Problem to index pdfs that are scanned documents

Tagged: PDF

estevaoacioli
Participant
5 years ago #11183
It has a library composed by 9 thousand PDFs and almost half of them are scanned documents, whenever I try to index my PDFs I come across the following error
Wrong format returned for the extracted file, can not extract the <body>.
Do you have any way to fix this? Or at least configure the plugin so that when it finds a PDF that can not be indexed, it skip and move on to the next?
Editing PDFs one by one and checking the box for not indexing is something I can not do because there are thousands if I had some way to do it in bulk, I know exactly which PDFs need to be ignored. For all of them have the name with the same prefix.
wpsolr
Keymaster
5 years ago #11185
How many PDFs should be ignored?
estevaoacioli
Participant
5 years ago #11186
There are 1699 files, in the image I did a search for “PAD” which is the prefix common to all.
estevaoacioli
Participant
5 years ago #11187
I believe the correct procedure for the plugin to work fine in these cases would be, always find a PDF that could not be indexed it would jump to the next one and would display an alert message, informing the name and ID of the PDF as well as the reason of having ignored it during indexing. Another option would be to display alert messages at the end of the indexing process.
wpsolr
Keymaster
5 years ago #11188
Those PDFs only contain an image, no text?
wpsolr
Keymaster
5 years ago #11190
Could you send me a link to one of the pdfs, so I can make an indexing test?
wpsolr
Keymaster
5 years ago #11194
Thanks. I got it.
wpsolr
Keymaster
5 years ago #11195
Here is a quick fix to help you complete your indexing.
In file /wp-content/plugins/wpsolr-pro/wpsolr/core/classes/engines/class-wpsolr-abstractindexclient.php, comment out line:
throw new \Exception( 'Wrong format returned for the extracted file, cannot extract the <body>.' );
If it’s ok, the fix will be part of next release.
estevaoacioli
Participant
5 years ago #11197
Ok, now the plugin is working
sorry for the bad English 😉
Thank you very much

Viewing 9 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic.