Problem to index pdfs that are scanned documents
It has a library composed by 9 thousand PDFs and almost half of them are scanned documents, whenever I try to index my PDFs I come across the following error
Wrong format returned for the extracted file, can not extract the <body>.
Do you have any way to fix this? Or at least configure the plugin so that when it finds a PDF that can not be indexed, it skip and move on to the next?
Editing PDFs one by one and checking the box for not indexing is something I can not do because there are thousands if I had some way to do it in bulk, I know exactly which PDFs need to be ignored. For all of them have the name with the same prefix.
I believe the correct procedure for the plugin to work fine in these cases would be, always find a PDF that could not be indexed it would jump to the next one and would display an alert message, informing the name and ID of the PDF as well as the reason of having ignored it during indexing. Another option would be to display alert messages at the end of the indexing process.
Here is a quick fix to help you complete your indexing.
In file /wp-content/plugins/wpsolr-pro/wpsolr/core/classes/engines/class-wpsolr-abstractindexclient.php, comment out line:
throw new \Exception( 'Wrong format returned for the extracted file, cannot extract the <body>.' );
If it’s ok, the fix will be part of next release.
You must be logged in to reply to this topic.