Indexing error : Invalid UTF-8 middle byte 0x3c

  • mdruilhe
    Participant
    2 years, 10 months ago #28636

    Hello,
    After recreating the index (Apache Solr self hosted) due to the integer cast error (following https://www.wpsolr.com/forums/topic/indexing-return-error/) I’m now getting the following error :
    Solr HTTP error: Bad Request (400) { "responseHeader":{ "status":400, "QTime":60}, "error":{ "metadata":[ "error-class","org.apache.solr.common.SolrException", "root-error-class","java.io.CharConversionException"], "msg":"Invalid UTF-8 middle byte 0x3c (at char #155325, byte #155368)", "code":400}}
    I suppose it’s an encoding problem.
    Please, do you have any clue on this ?

    Thanks in advance

    wpsolr
    Keymaster
    2 years, 10 months ago #28637

    You can try the procedure described at https://www.wpsolr.com/forums/topic/wpsolr-error-on-term/

    (Basically, it is about finding the post ID with the error by indexing with debug mod ON. Then editing the post or update its status to “no searched”).

    mdruilhe
    Participant
    2 years, 10 months ago #28643

    Thank you for your help and quick reply,
    I finally could re-index the site, using a batch size of one and eliminating one post at a time.
    I had to eliminate 5 medias from the index, without understanding whats was specially wrong in the contents.
    Maybe the problem relies on wrong utf-8 conversion from the PDF contents.
    The error is in all cases the same : “Invalid UTF-8 middle byte 0x3c”

    For information, here is an example of indexing error:

    
    Posts excluded from the index:<br><b>4163,2448,2454,2872,3045</b><br><br>******** DEBUG ACTIVATED - Beginning of new loop (batch size) *******<br><br>******** DEBUG ACTIVATED - Query documents from last post date *******<br><br>Query:<br><b>SELECT ID, post_modified, post_parent, post_type FROM 6276mjYKvM_posts AS A WHERE ((post_modified = '2021-12-14 08:25:55' AND ID > 4227) OR (post_modified > '2021-12-14 08:25:55')) AND ( post_status IN ('publish') AND ( post_type = 'post' ) ) AND ID NOT IN (4163,2448,2454,2872,3045) ORDER BY post_modified ASC, ID ASC LIMIT 1</b><br><br>Last post date:<br><b>2021-12-14 08:25:55</b><br><br>Last post ID:<br><b>4227</b><br><br>******** DEBUG ACTIVATED - No more documents, end of document loop *******<br><br>Posts excluded from the index:<br><b>4163,2448,2454,2872,3045</b><br><br>******** DEBUG ACTIVATED - Beginning of new loop (batch size) *******<br><br>******** DEBUG ACTIVATED - Query documents from last post date *******<br><br>Query:<br><b>SELECT ID, post_modified, post_parent, post_type FROM 6276mjYKvM_posts AS A WHERE ((post_modified = '2021-04-07 17:45:29' AND ID > 242) OR (post_modified > '2021-04-07 17:45:29')) AND ( post_status IN ('publish') AND ( post_type = 'page' ) ) AND ID NOT IN (4163,2448,2454,2872,3045) ORDER BY post_modified ASC, ID ASC LIMIT 1</b><br><br>Last post date:<br><b>2021-04-07 17:45:29</b><br><br>Last post ID:<br><b>242</b><br><br>******** DEBUG ACTIVATED - No more documents, end of document loop *******<br><br>Posts excluded from the index:<br><b>4163,2448,2454,2872,3045</b><br><br>******** DEBUG ACTIVATED - Beginning of new loop (batch size) *******<br><br>******** DEBUG ACTIVATED - Query documents from last post date *******<br><br>Query:<br><b>SELECT ID, post_modified, post_parent, post_type FROM 6276mjYKvM_posts AS A WHERE ((post_modified = '2021-12-10 17:58:17' AND ID > 4359) OR (post_modified > '2021-12-10 17:58:17')) AND ( ( post_status='publish' OR post_status='inherit' ) AND post_type='attachment' AND post_mime_type in ('application/java','application/javascript','application/msword','application/octet-stream','application/octet-stream','application/onenote','application/oxps','application/pdf','application/rar','application/rtf','application/ttaf+xml','application/vnd.apple.keynote','application/vnd.apple.numbers','application/vnd.apple.pages','application/vnd.ms-access','application/vnd.ms-excel','application/vnd.ms-excel.addin.macroEnabled.12','application/vnd.ms-excel.sheet.binary.macroEnabled.12','application/vnd.ms-excel.sheet.macroEnabled.12','application/vnd.ms-excel.template.macroEnabled.12','application/vnd.ms-powerpoint','application/vnd.ms-powerpoint.addin.macroEnabled.12','application/vnd.ms-powerpoint.presentation.macroEnabled.12','application/vnd.ms-powerpoint.slide.macroEnabled.12','application/vnd.ms-powerpoint.slideshow.macroEnabled.12','application/vnd.ms-powerpoint.template.macroEnabled.12','application/vnd.ms-project','application/vnd.ms-word.document.macroEnabled.12','application/vnd.ms-word.template.macroEnabled.12','application/vnd.ms-write','application/vnd.ms-xpsdocument','application/vnd.oasis.opendocument.chart','application/vnd.oasis.opendocument.database','application/vnd.oasis.opendocument.formula','application/vnd.oasis.opendocument.graphics','application/vnd.oasis.opendocument.presentation','application/vnd.oasis.opendocument.spreadsheet','application/vnd.oasis.opendocument.text','application/vnd.openxmlformats-officedocument.presentationml.presentation','application/vnd.openxmlformats-officedocument.presentationml.slide','application/vnd.openxmlformats-officedocument.presentationml.slideshow','application/vnd.openxmlformats-officedocument.presentationml.template','application/vnd.openxmlformats-officedocument.spreadsheetml.sheet','application/vnd.openxmlformats-officedocument.spreadsheetml.template','application/vnd.openxmlformats-officedocument.wordprocessingml.document','application/vnd.openxmlformats-officedocument.wordprocessingml.template','application/wordperfect','application/x-7z-compressed','application/x-gzip','application/x-tar','application/zip','text/calendar','text/css','text/csv','text/html','text/plain','text/richtext','text/tab-separated-values','text/vtt') ) AND ID NOT IN (4163,2448,2454,2872,3045) ORDER BY post_modified ASC, ID ASC LIMIT 1</b><br><br>Last post date:<br><b>2021-12-10 17:58:17</b><br><br>Last post ID:<br><b>4359</b><br><br>Post ID to be indexed (attachment):<br><b>2344</b><br><br>Attachment to be sent:<br><b>{ "id": "2344", "PID": "2344", "type": "attachment", "meta_type_s": "post_type", "displaymodified": "2021-12-14T08:30:35Z", "title": "S.Arsene", "title_s": "S.Arsene", "permalink": "https:\/\/www.afsop.fr\/congres\/nantes-2008-la-chirurgie-des-muscles-oculo-moteurs\/s-arsene\/", "post_parent_i": "2336", "post_status_s": "inherit", "content": ". {\n \"S.Arsene.pdf\":\" Y-a-t-il un int\u00e9r\u00eat \u00e0 r\u00e9aliser des tests de d\u00e9compensation d\u2019angle dans le cadre du bilan pr\u00e9op\u00e9ratoire des strabismes divergents intermittents et constants ? Auteurs : ARSENE Sophie, SANTALLIER Martine, DUSCHENE Sophie, SULTANIM Ang\u00e9lina, PISELLA Pierre Jean Introduction : Le but de cette \u00e9tude \u00e9tait de comparer les r\u00e9sultats des tests de d\u00e9compensations d\u2019angle au sein de deux populations de strabismes divergents diff\u00e9rents par l\u2019\u00e9tat de la correspondance r\u00e9tinienne. Sujets et m\u00e9thode : notre \u00e9tude r\u00e9trospective portait sur 148 patients : 102 en correspondance r\u00e9tinienne normale (CRN) et 46 en correspondance r\u00e9tinienne anormale (CRA), tous corrig\u00e9s avec une correction optique totale. Les tests de d\u00e9compensation utilis\u00e9s \u00e9taient : +3 dioptries, -3 dioptries et l\u2019\u00e9preuve de Marlow. R\u00e9sultats : dans la population des CRN l\u2019angle moyen de loin \u00e9tait de 24,65 dioptries et il \u00e9tait statistiquement diff\u00e9rent avec les angles moyens de loin mesur\u00e9s avec -3 dioptries et apr\u00e8s une heure d\u2019occlusion. Pour l\u2019angle moyen de pr\u00e8s (20,34 dioptries), il \u00e9tait statistiquement diff\u00e9rent avec les 3 tests. Dans la population des CRA l\u2019angle moyen de loin \u00e9tait de 31,97 dioptries et il n\u2019\u00e9tait pas statistiquement diff\u00e9rent avec les 3 tests. Pour l\u2019angle moyen de pr\u00e8s (34,58 dioptries) il \u00e9tait statistiquement diff\u00e9rent avec -3 dioptries. Discussion : dans la population en CRN ces tests nous ont permis de la valeur maximale de l\u2019angle en divergence notamment de pr\u00e8s. Dans la population des CRA la pertinence de ces tests a pu \u00eatre remise en cause. Le test d\u2019occlusion d\u2019une heure reste utile en cas de doute sur la correspondance r\u00e9tinienne. Conclusion : il semble donc important de r\u00e9aliser ces tests de d\u00e9compensation d\u2019angle dans le bilan pr\u00e9op\u00e9ratoire notamment pour les strabismes divergents en CRN. \",\n \"S.Arsene.pdf_metadata\":,\n \"pdf:PDFVersion\",,\n \"xmp:CreatorTool\",,\n \"stream_content_type\",,\n \"access_permission:modify_annotations\",,\n \"access_permission:can_print_degraded\",,\n \"dc:creator\",,\n \"language\",,\n \"dcterms:created\",,\n \"Last-Modified\",,\n \"dcterms:modified\",,\n \"dc:format\",,\n \"Last-Save-Date\",,\n \"pdf:docinfo:creator_tool\",,\n \"access_permission:fill_in_form\",,\n \"pdf:docinfo:modified\",,\n \"stream_name\",,\n \"meta:save-date\",,\n \"pdf:encrypted\",,\n \"modified\",,\n \"Content-Type\",,\n \"stream_size\",,\n \"pdf:docinfo:creator\",,\n \"X-Parsed-By\",,\n \"creator\",,\n \"dc:language\",,\n \"meta:author\",,\n \"meta:creation-date\",,\n \"stream_source_info\",,\n \"created\",,\n \"access_permission:extract_for_accessibility\",,\n \"access_permission:assemble_document\",,\n \"xmpTPg:NPages\",,\n \"Creation-Date\",,\n \"resourceName\",,\n \"access_permission:extract_content\",,\n \"access_permission:can_print\",,\n \"Author\",,\n \"producer\",,\n \"access_permission:can_modify\",,\n \"pdf:docinfo:producer\",,\n \"pdf:docinfo:created\",]}\n 2018\/01\/S.Arsene.pdf", "snippet_s": ". {\n \"S.Arsene.pdf\":\" Y-a-t-il un int\u00e9r?", "post_author_s": "3", "author": "AFSOP", "menu_order_i": 0, "PID_i": "2344", "author_s": "https:\/\/www.afsop.fr\/author\/admin_afsop\/", "displaydate": "2018-01-22T02:27:45Z", "displaydate_dt": "2018-01-22T02:27:45Z", "date": "2018-01-22T01:27:45Z", "displaymodified_dt": "2021-12-14T08:30:35Z", "modified": "2021-12-14T07:30:35Z", "displaymodified_dt_i": "1639470635000", "displaymodified_dt_y_i": "2021", "displaymodified_dt_ym_i": "12", "displaymodified_dt_yw_i": "50", "displaymodified_dt_yd_i": "348", "displaymodified_dt_md_i": "14", "displaymodified_dt_wd_i": "3", "displaymodified_dt_dh_i": "8", "displaymodified_dt_dm_i": "30", "displaymodified_dt_ds_i": "35", "displaydate_dt_i": "1516588065000", "displaydate_dt_y_i": "2018", "displaydate_dt_ym_i": "1", "displaydate_dt_yw_i": "4", "displaydate_dt_yd_i": "22", "displaydate_dt_md_i": "22", "displaydate_dt_wd_i": "2", "displaydate_dt_dh_i": "2", "displaydate_dt_dm_i": "27", "displaydate_dt_ds_i": "45", "date_i": "1516584465000", "date_y_i": "2018", "date_ym_i": "1", "date_yw_i": "4", "date_yd_i": "22", "date_md_i": "22", "date_wd_i": "2", "date_dh_i": "1", "date_dm_i": "27", "date_ds_i": "45", "displaydate_i": "1516588065000", "displaydate_y_i": "2018", "displaydate_ym_i": "1", "displaydate_yw_i": "4", "displaydate_yd_i": "22", "displaydate_md_i": "22", "displaydate_wd_i": "2", "displaydate_dh_i": "2", "displaydate_dm_i": "27", "displaydate_ds_i": "45", "modified_i": "1639467035000", "modified_y_i": "2021", "modified_ym_i": "12", "modified_yw_i": "50", "modified_yd_i": "348", "modified_md_i": "14", "modified_wd_i": "3", "modified_dh_i": "7", "modified_dm_i": "30", "modified_ds_i": "35", "comments": [], "numcomments": 0, "categories_str": [], "categories": [ "2018\/01\/S.Arsene.pdf" ], "flat_hierarchy_categories_str": [], "non_flat_hierarchy_categories_str": [], "tags": [], "_wp_attached_file_str": [ "2018\/01\/S.Arsene.pdf" ] }</b><br><br>{"nb_results":0,"status":400,"message":"Solr HTTP error: Bad Request (400)\n{\n &quot;responseHeader&quot;:{\n &quot;status&quot;:400,\n &quot;QTime&quot;:0},\n &quot;error&quot;:{\n &quot;metadata&quot;:[\n &quot;error-class&quot;,&quot;org.apache.solr.common.SolrException&quot;,\n &quot;root-error-class&quot;,&quot;java.io.CharConversionException&quot;],\n &quot;msg&quot;:&quot;Invalid UTF-8 middle byte 0x3c (at char #3735, byte #127)&quot;,\n &quot;code&quot;:400}}\n","indexing_complete":false}
    

    Best regards,
    M.

    wpsolr
    Keymaster
    2 years, 10 months ago #28647

    Great. And thanks for the feedback!

    AndreasJ
    Participant
    1 year, 6 months ago #33164

    I had similar problem. I think the problem is a multibyte character that is truncated in field snippet_s. See the question-mark in snippet:
    "snippet_s": ". {\n \"S.Arsene.pdf\":\" Y-a-t-il un int\u00e9r?"

    This origins from content
    "content": ". {\n \"S.Arsene.pdf\":\" Y-a-t-il un int\u00e9r\u00eat \u00e0 r\u00e9aliser

    Fix for Version 23.0:

    I suggest fixing the code here wpsolr-pro/wpsolr/core/classes/models/post/class-wpsolr-model-post.php:257 to replace substr with mb_substr.

    Patch:

    
    diff --git a/wp-content/plugins/wpsolr-pro/wpsolr/core/classes/models/post/class-wpsolr-model-post.php b/wp-content/plugins/wpsolr-pro/wpsolr/core/classes/models/post/class-wpsolr-model-post.php
    index a13bde04..2b10a92c 100644
    --- a/wp-content/plugins/wpsolr-pro/wpsolr/core/classes/models/post/class-wpsolr-model-post.php:257
    +++ b/wp-content/plugins/wpsolr-pro/wpsolr/core/classes/models/post/class-wpsolr-model-post.php
    @@ -253,8 +253,8 @@ class WPSOLR_Model_Post extends WPSOLR_Model_Abstract {
     				static::$highlight_fragsize = WPSOLR_Service_Container::getOption()->get_search_max_length_highlighting();
     			}
     			$snippet                                                                   = strip_tags( $pexcerpt );
    -			$this->solarium_document_for_update[ WpSolrSchema::_FIELD_NAME_SNIPPET_S ] =
    -				( ! empty( $snippet ) ) ? $snippet : substr( $this->solarium_document_for_update[ WpSolrSchema::_FIELD_NAME_CONTENT ], 0, static::$highlight_fragsize );
    +			$this->solarium_document_for_update[ WpSolrSchema::_FIELD_NAME_SNIPPET_S ] = 
    +				( ! empty( $snippet ) ) ? $snippet : mb_substr( $this->solarium_document_for_update[ WpSolrSchema::_FIELD_NAME_CONTENT ], 0, static::$highlight_fragsize );
    		}
    
    • This reply was modified 1 year, 6 months ago by AndreasJ. Reason: format
    wpsolr
    Keymaster
    1 year, 6 months ago #33166

    @AndreasJ Did mb_substr() fix the issue for you?

    AndreasJ
    Participant
    1 year, 6 months ago #33168

    Yes, this works for me.
    This resolved problems with around 15 documents out of 1500.

    Would you consider to make this change in upcoming release?

    I found simliar code in class-wpsolr-model-abstract.php:132 that might need this change (and perhaps many more places?).

    Might add that I am using:
    Using Apache Solr, Opensolr SW-SOLR-8-0
    and PHP 8.0, WordPress 6.2

    wpsolr
    Keymaster
    1 year, 6 months ago #33169

    Thanks for the feedback. This is an old unsoved problem.

    I added this issue to the next release: https://www.wpsolr.com/forums/topic/release-23-1/

Viewing 8 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic.