meta data shown in search results

  • Yann
    Participant
    2 years, 1 month ago #29159

    Hi,
    since I upgraded, indexed results exerpt shows meta data garbage:

    L’Avent – l’amour
    Modified”,, “access_permission:can_modify”,, “pdf:docinfo:created”,]} « Croyons au nom de son Fils Jésus-Christ et aimons-nous les uns les autres, comme il nous l’a ordonné

    What can we do to keep only the document text as it was before?

    Thanks.

    • This topic was modified 2 years, 1 month ago by wpsolr.
    wpsolr
    Keymaster
    2 years, 1 month ago #29160
    wpsolr
    Keymaster
    2 years, 1 month ago #29163

    In the solrconfig.xml installed by WPSOLR with each new index, meta informations are marked to be ignored:

    < requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" >
    < lst name="defaults">
    < str name="lowernames">true< /str>
    < str name="fmap.meta">ignored_< /str>
    < str name="fmap.content">_text_< /str>
    < /lst>
    < /requestHandler>

    • This reply was modified 2 years, 1 month ago by wpsolr.
    • This reply was modified 2 years, 1 month ago by wpsolr.
    • This reply was modified 2 years, 1 month ago by wpsolr.
    Yann
    Participant
    2 years, 1 month ago #29167

    Hi,
    I do not understand what is the solution. It used to work properly.

    wpsolr
    Keymaster
    2 years, 1 month ago #29170

    PDF is one of the many text formats extracted by the Tika library, in Solr.

    The way it is done is show in my previous answer, with the parameter ignored_ in solrconfig.xml installed by WPSOLR on your index, which is Solr’s way to ask for no meta data extracted from the PDF.

    This is a default setup, but you can tweak it eventually as discussed in https://stackoverflow.com/questions/47934257/importing-files-with-solr-cell-tika-is-mixing-metadata-fields-with-content

Viewing 5 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic.