Inquiry: Solr Search indexing speed for large pdfs ~10-60mb and in bulk quantity

  • smeett
    Participant
    1 year, 5 months ago #31502

    Hi, I have been exploring Solr search to be used on a custom wordpress setup hosted on AWS. Our website project has ~4000 pdfs which are of size anywhere between 10-60mb. If we wanted to OCR index all of them, how long does it take with this plugin? I am happy to update php memory limits or timeout settings to expedite the process but want to ensure there is support for the aforementioned size/quantity. Any specific search API to be utilised/recommended if not Solr search for this use case: https://www.wpsolr.com/pricing/

    wpsolr
    Keymaster
    1 year, 5 months ago #31507

    Our AI extraction add-on can send your documents to Google Vision before indexing the extracted content.
    This API can send documents in batch (as opposed to the Amazon Rekognition API).

    wpsolr
    Keymaster
    1 year, 5 months ago #31511

    (I assumed your PDFs contain an image rather than text, hence the suggestion of using an OCR service to convert the images to text. If not, you need nothing special to extract texts from your PDFs)

    wpsolr
    Keymaster
    1 year, 5 months ago #31512

    Concerning speed, I really do not know. You will have to try to find out I’m afraid.

    smeett
    Participant
    1 year, 5 months ago #31513

    Nope, text-heavy pdfs. These are primarily handbooks, handouts or other legal documents. There are images within the pdfs but they’re not required to be indexed or simply need to be ignored. Is there a short trial where I can experiment/install the plugin to identify the time required/taken?

    wpsolr
    Keymaster
    1 year, 5 months ago #31514

    Trial is on our pricing page

Viewing 6 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic.