Problem to index pdfs that are scanned documents
3 years, 12 months ago #11183
It has a library composed by 9 thousand PDFs and almost half of them are scanned documents, whenever I try to index my PDFs I come across the following error
Wrong format returned for the extracted file, can not extract the <body>.
Do you have any way to fix this? Or at least configure the plugin so that when it finds a PDF that can not be indexed, it skip and move on to the next?
Editing PDFs one by one and checking the box for not indexing is something I can not do because there are thousands if I had some way to do it in bulk, I know exactly which PDFs need to be ignored. For all of them have the name with the same prefix.3 years, 11 months ago #11186
There are 1699 files, in the image I did a search for “PAD” which is the prefix common to all.3 years, 11 months ago #11187
I believe the correct procedure for the plugin to work fine in these cases would be, always find a PDF that could not be indexed it would jump to the next one and would display an alert message, informing the name and ID of the PDF as well as the reason of having ignored it during indexing. Another option would be to display alert messages at the end of the indexing process.3 years, 11 months ago #11195
Here is a quick fix to help you complete your indexing.
In file /wp-content/plugins/wpsolr-pro/wpsolr/core/classes/engines/class-wpsolr-abstractindexclient.php, comment out line:
throw new \Exception( 'Wrong format returned for the extracted file, cannot extract the <body>.' );
If it’s ok, the fix will be part of next release.3 years, 11 months ago #11197
Ok, now the plugin is working
sorry for the bad English 😉
Thank you very much
You must be logged in to reply to this topic.