Indexing error : Invalid UTF-8 middle byte 0x3c

  • mdruilhe
    Participant
    # 6 months, 3 weeks ago

    Hello,
    After recreating the index (Apache Solr self hosted) due to the integer cast error (following https://www.wpsolr.com/forums/topic/indexing-return-error/) I’m now getting the following error :
    Solr HTTP error: Bad Request (400) { "responseHeader":{ "status":400, "QTime":60}, "error":{ "metadata":[ "error-class","org.apache.solr.common.SolrException", "root-error-class","java.io.CharConversionException"], "msg":"Invalid UTF-8 middle byte 0x3c (at char #155325, byte #155368)", "code":400}}
    I suppose it’s an encoding problem.
    Please, do you have any clue on this ?

    Thanks in advance

    wpsolr
    Keymaster
    # 6 months, 3 weeks ago

    You can try the procedure described at https://www.wpsolr.com/forums/topic/wpsolr-error-on-term/

    (Basically, it is about finding the post ID with the error by indexing with debug mod ON. Then editing the post or update its status to “no searched”).

    mdruilhe
    Participant
    # 6 months, 3 weeks ago

    Thank you for your help and quick reply,
    I finally could re-index the site, using a batch size of one and eliminating one post at a time.
    I had to eliminate 5 medias from the index, without understanding whats was specially wrong in the contents.
    Maybe the problem relies on wrong utf-8 conversion from the PDF contents.
    The error is in all cases the same : “Invalid UTF-8 middle byte 0x3c”

    For information, here is an example of indexing error:

    
    Posts excluded from the index:<br><b>4163,2448,2454,2872,3045</b><br><br>******** DEBUG ACTIVATED - Beginning of new loop (batch size) *******<br><br>******** DEBUG ACTIVATED - Query documents from last post date *******<br><br>Query:<br><b>SELECT ID, post_modified, post_parent, post_type FROM 6276mjYKvM_posts AS A WHERE ((post_modified = '2021-12-14 08:25:55' AND ID > 4227) OR (post_modified > '2021-12-14 08:25:55')) AND ( post_status IN ('publish') AND ( post_type = 'post' ) ) AND ID NOT IN (4163,2448,2454,2872,3045) ORDER BY post_modified ASC, ID ASC LIMIT 1</b><br><br>Last post date:<br><b>2021-12-14 08:25:55</b><br><br>Last post ID:<br><b>4227</b><br><br>******** DEBUG ACTIVATED - No more documents, end of document loop *******<br><br>Posts excluded from the index:<br><b>4163,2448,2454,2872,3045</b><br><br>******** DEBUG ACTIVATED - Beginning of new loop (batch size) *******<br><br>******** DEBUG ACTIVATED - Query documents from last post date *******<br><br>Query:<br><b>SELECT ID, post_modified, post_parent, post_type FROM 6276mjYKvM_posts AS A WHERE ((post_modified = '2021-04-07 17:45:29' AND ID > 242) OR (post_modified > '2021-04-07 17:45:29')) AND ( post_status IN ('publish') AND ( post_type = 'page' ) ) AND ID NOT IN (4163,2448,2454,2872,3045) ORDER BY post_modified ASC, ID ASC LIMIT 1</b><br><br>Last post date:<br><b>2021-04-07 17:45:29</b><br><br>Last post ID:<br><b>242</b><br><br>******** DEBUG ACTIVATED - No more documents, end of document loop *******<br><br>Posts excluded from the index:<br><b>4163,2448,2454,2872,3045</b><br><br>******** DEBUG ACTIVATED - Beginning of new loop (batch size) *******<br><br>******** DEBUG ACTIVATED - Query documents from last post date *******<br><br>Query:<br><b>SELECT ID, post_modified, post_parent, post_type FROM 6276mjYKvM_posts AS A WHERE ((post_modified = '2021-12-10 17:58:17' AND ID > 4359) OR (post_modified > '2021-12-10 17:58:17')) AND ( ( post_status='publish' OR post_status='inherit' ) AND post_type='attachment' AND post_mime_type in ('application/java','application/javascript','application/msword','application/octet-stream','application/octet-stream','application/onenote','application/oxps','application/pdf','application/rar','application/rtf','application/ttaf+xml','application/vnd.apple.keynote','application/vnd.apple.numbers','application/vnd.apple.pages','application/vnd.ms-access','application/vnd.ms-excel','application/vnd.ms-excel.addin.macroEnabled.12','application/vnd.ms-excel.sheet.binary.macroEnabled.12','application/vnd.ms-excel.sheet.macroEnabled.12','application/vnd.ms-excel.template.macroEnabled.12','application/vnd.ms-powerpoint','application/vnd.ms-powerpoint.addin.macroEnabled.12','application/vnd.ms-powerpoint.presentation.macroEnabled.12','application/vnd.ms-powerpoint.slide.macroEnabled.12','application/vnd.ms-powerpoint.slideshow.macroEnabled.12','application/vnd.ms-powerpoint.template.macroEnabled.12','application/vnd.ms-project','application/vnd.ms-word.document.macroEnabled.12','application/vnd.ms-word.template.macroEnabled.12','application/vnd.ms-write','application/vnd.ms-xpsdocument','application/vnd.oasis.opendocument.chart','application/vnd.oasis.opendocument.database','application/vnd.oasis.opendocument.formula','application/vnd.oasis.opendocument.graphics','application/vnd.oasis.opendocument.presentation','application/vnd.oasis.opendocument.spreadsheet','application/vnd.oasis.opendocument.text','application/vnd.openxmlformats-officedocument.presentationml.presentation','application/vnd.openxmlformats-officedocument.presentationml.slide','application/vnd.openxmlformats-officedocument.presentationml.slideshow','application/vnd.openxmlformats-officedocument.presentationml.template','application/vnd.openxmlformats-officedocument.spreadsheetml.sheet','application/vnd.openxmlformats-officedocument.spreadsheetml.template','application/vnd.openxmlformats-officedocument.wordprocessingml.document','application/vnd.openxmlformats-officedocument.wordprocessingml.template','application/wordperfect','application/x-7z-compressed','application/x-gzip','application/x-tar','application/zip','text/calendar','text/css','text/csv','text/html','text/plain','text/richtext','text/tab-separated-values','text/vtt') ) AND ID NOT IN (4163,2448,2454,2872,3045) ORDER BY post_modified ASC, ID ASC LIMIT 1</b><br><br>Last post date:<br><b>2021-12-10 17:58:17</b><br><br>Last post ID:<br><b>4359</b><br><br>Post ID to be indexed (attachment):<br><b>2344</b><br><br>Attachment to be sent:<br><b>{ "id": "2344", "PID": "2344", "type": "attachment", "meta_type_s": "post_type", "displaymodified": "2021-12-14T08:30:35Z", "title": "S.Arsene", "title_s": "S.Arsene", "permalink": "https:\/\/www.afsop.fr\/congres\/nantes-2008-la-chirurgie-des-muscles-oculo-moteurs\/s-arsene\/", "post_parent_i": "2336", "post_status_s": "inherit", "content": ". {\n \"S.Arsene.pdf\":\" Y-a-t-il un int\u00e9r\u00eat \u00e0 r\u00e9aliser des tests de d\u00e9compensation d\u2019angle dans le cadre du bilan pr\u00e9op\u00e9ratoire des strabismes divergents intermittents et constants ? Auteurs : ARSENE Sophie, SANTALLIER Martine, DUSCHENE Sophie, SULTANIM Ang\u00e9lina, PISELLA Pierre Jean Introduction : Le but de cette \u00e9tude \u00e9tait de comparer les r\u00e9sultats des tests de d\u00e9compensations d\u2019angle au sein de deux populations de strabismes divergents diff\u00e9rents par l\u2019\u00e9tat de la correspondance r\u00e9tinienne. Sujets et m\u00e9thode : notre \u00e9tude r\u00e9trospective portait sur 148 patients : 102 en correspondance r\u00e9tinienne normale (CRN) et 46 en correspondance r\u00e9tinienne anormale (CRA), tous corrig\u00e9s avec une correction optique totale. Les tests de d\u00e9compensation utilis\u00e9s \u00e9taient : +3 dioptries, -3 dioptries et l\u2019\u00e9preuve de Marlow. R\u00e9sultats : dans la population des CRN l\u2019angle moyen de loin \u00e9tait de 24,65 dioptries et il \u00e9tait statistiquement diff\u00e9rent avec les angles moyens de loin mesur\u00e9s avec -3 dioptries et apr\u00e8s une heure d\u2019occlusion. Pour l\u2019angle moyen de pr\u00e8s (20,34 dioptries), il \u00e9tait statistiquement diff\u00e9rent avec les 3 tests. Dans la population des CRA l\u2019angle moyen de loin \u00e9tait de 31,97 dioptries et il n\u2019\u00e9tait pas statistiquement diff\u00e9rent avec les 3 tests. Pour l\u2019angle moyen de pr\u00e8s (34,58 dioptries) il \u00e9tait statistiquement diff\u00e9rent avec -3 dioptries. Discussion : dans la population en CRN ces tests nous ont permis de la valeur maximale de l\u2019angle en divergence notamment de pr\u00e8s. Dans la population des CRA la pertinence de ces tests a pu \u00eatre remise en cause. Le test d\u2019occlusion d\u2019une heure reste utile en cas de doute sur la correspondance r\u00e9tinienne. Conclusion : il semble donc important de r\u00e9aliser ces tests de d\u00e9compensation d\u2019angle dans le bilan pr\u00e9op\u00e9ratoire notamment pour les strabismes divergents en CRN. \",\n \"S.Arsene.pdf_metadata\":,\n \"pdf:PDFVersion\",,\n \"xmp:CreatorTool\",,\n \"stream_content_type\",,\n \"access_permission:modify_annotations\",,\n \"access_permission:can_print_degraded\",,\n \"dc:creator\",,\n \"language\",,\n \"dcterms:created\",,\n \"Last-Modified\",,\n \"dcterms:modified\",,\n \"dc:format\",,\n \"Last-Save-Date\",,\n \"pdf:docinfo:creator_tool\",,\n \"access_permission:fill_in_form\",,\n \"pdf:docinfo:modified\",,\n \"stream_name\",,\n \"meta:save-date\",,\n \"pdf:encrypted\",,\n \"modified\",,\n \"Content-Type\",,\n \"stream_size\",,\n \"pdf:docinfo:creator\",,\n \"X-Parsed-By\",,\n \"creator\",,\n \"dc:language\",,\n \"meta:author\",,\n \"meta:creation-date\",,\n \"stream_source_info\",,\n \"created\",,\n \"access_permission:extract_for_accessibility\",,\n \"access_permission:assemble_document\",,\n \"xmpTPg:NPages\",,\n \"Creation-Date\",,\n \"resourceName\",,\n \"access_permission:extract_content\",,\n \"access_permission:can_print\",,\n \"Author\",,\n \"producer\",,\n \"access_permission:can_modify\",,\n \"pdf:docinfo:producer\",,\n \"pdf:docinfo:created\",]}\n 2018\/01\/S.Arsene.pdf", "snippet_s": ". {\n \"S.Arsene.pdf\":\" Y-a-t-il un int\u00e9r?", "post_author_s": "3", "author": "AFSOP", "menu_order_i": 0, "PID_i": "2344", "author_s": "https:\/\/www.afsop.fr\/author\/admin_afsop\/", "displaydate": "2018-01-22T02:27:45Z", "displaydate_dt": "2018-01-22T02:27:45Z", "date": "2018-01-22T01:27:45Z", "displaymodified_dt": "2021-12-14T08:30:35Z", "modified": "2021-12-14T07:30:35Z", "displaymodified_dt_i": "1639470635000", "displaymodified_dt_y_i": "2021", "displaymodified_dt_ym_i": "12", "displaymodified_dt_yw_i": "50", "displaymodified_dt_yd_i": "348", "displaymodified_dt_md_i": "14", "displaymodified_dt_wd_i": "3", "displaymodified_dt_dh_i": "8", "displaymodified_dt_dm_i": "30", "displaymodified_dt_ds_i": "35", "displaydate_dt_i": "1516588065000", "displaydate_dt_y_i": "2018", "displaydate_dt_ym_i": "1", "displaydate_dt_yw_i": "4", "displaydate_dt_yd_i": "22", "displaydate_dt_md_i": "22", "displaydate_dt_wd_i": "2", "displaydate_dt_dh_i": "2", "displaydate_dt_dm_i": "27", "displaydate_dt_ds_i": "45", "date_i": "1516584465000", "date_y_i": "2018", "date_ym_i": "1", "date_yw_i": "4", "date_yd_i": "22", "date_md_i": "22", "date_wd_i": "2", "date_dh_i": "1", "date_dm_i": "27", "date_ds_i": "45", "displaydate_i": "1516588065000", "displaydate_y_i": "2018", "displaydate_ym_i": "1", "displaydate_yw_i": "4", "displaydate_yd_i": "22", "displaydate_md_i": "22", "displaydate_wd_i": "2", "displaydate_dh_i": "2", "displaydate_dm_i": "27", "displaydate_ds_i": "45", "modified_i": "1639467035000", "modified_y_i": "2021", "modified_ym_i": "12", "modified_yw_i": "50", "modified_yd_i": "348", "modified_md_i": "14", "modified_wd_i": "3", "modified_dh_i": "7", "modified_dm_i": "30", "modified_ds_i": "35", "comments": [], "numcomments": 0, "categories_str": [], "categories": [ "2018\/01\/S.Arsene.pdf" ], "flat_hierarchy_categories_str": [], "non_flat_hierarchy_categories_str": [], "tags": [], "_wp_attached_file_str": [ "2018\/01\/S.Arsene.pdf" ] }</b><br><br>{"nb_results":0,"status":400,"message":"Solr HTTP error: Bad Request (400)\n{\n &quot;responseHeader&quot;:{\n &quot;status&quot;:400,\n &quot;QTime&quot;:0},\n &quot;error&quot;:{\n &quot;metadata&quot;:[\n &quot;error-class&quot;,&quot;org.apache.solr.common.SolrException&quot;,\n &quot;root-error-class&quot;,&quot;java.io.CharConversionException&quot;],\n &quot;msg&quot;:&quot;Invalid UTF-8 middle byte 0x3c (at char #3735, byte #127)&quot;,\n &quot;code&quot;:400}}\n","indexing_complete":false}
    

    Best regards,
    M.

    wpsolr
    Keymaster
    # 6 months, 3 weeks ago

    Great. And thanks for the feedback!

Viewing 4 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic.