Problem to index pdfs that are scanned documents

Tagged:
  • estevaoacioli
    Participant
    5 years ago #11183

    It has a library composed by 9 thousand PDFs and almost half of them are scanned documents, whenever I try to index my PDFs I come across the following error

    Wrong format returned for the extracted file, can not extract the <body>.

    Do you have any way to fix this? Or at least configure the plugin so that when it finds a PDF that can not be indexed, it skip and move on to the next?
    Editing PDFs one by one and checking the box for not indexing is something I can not do because there are thousands if I had some way to do it in bulk, I know exactly which PDFs need to be ignored. For all of them have the name with the same prefix.

    wpsolr
    Keymaster
    5 years ago #11185

    How many PDFs should be ignored?

    estevaoacioli
    Participant
    5 years ago #11186

    There are 1699 files, in the image I did a search for “PAD” which is the prefix common to all.

    Print from media library

    estevaoacioli
    Participant
    5 years ago #11187

    I believe the correct procedure for the plugin to work fine in these cases would be, always find a PDF that could not be indexed it would jump to the next one and would display an alert message, informing the name and ID of the PDF as well as the reason of having ignored it during indexing. Another option would be to display alert messages at the end of the indexing process.

    wpsolr
    Keymaster
    5 years ago #11188

    Those PDFs only contain an image, no text?

    wpsolr
    Keymaster
    5 years ago #11190

    Could you send me a link to one of the pdfs, so I can make an indexing test?

    wpsolr
    Keymaster
    5 years ago #11194

    Thanks. I got it.

    wpsolr
    Keymaster
    5 years ago #11195

    Here is a quick fix to help you complete your indexing.

    In file /wp-content/plugins/wpsolr-pro/wpsolr/core/classes/engines/class-wpsolr-abstractindexclient.php, comment out line:
    throw new \Exception( 'Wrong format returned for the extracted file, cannot extract the <body>.' );

    If it’s ok, the fix will be part of next release.

    estevaoacioli
    Participant
    5 years ago #11197

    Ok, now the plugin is working
    sorry for the bad English 😉

    Thank you very much

Viewing 9 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic.