meta data shown in search results
- YannParticipant1 year, 9 months ago #29159
since I upgraded, indexed results exerpt shows meta data garbage:
L’Avent – l’amour
Modified”,, “access_permission:can_modify”,, “pdf:docinfo:created”,]} « Croyons au nom de son Fils Jésus-Christ et aimons-nous les uns les autres, comme il nous l’a ordonné
What can we do to keep only the document text as it was before?
1 year, 9 months ago #29160
- This topic was modified 1 year, 9 months ago by wpsolr.
This is a standard Solr Tika feature unfortunately.
For instance: https://stackoverflow.com/questions/47934257/importing-files-with-solr-cell-tika-is-mixing-metadata-fields-with-content1 year, 9 months ago #29163
In the solrconfig.xml installed by WPSOLR with each new index, meta informations are marked to be ignored:
< requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" >
< lst name="defaults">
< str name="lowernames">true< /str>
< str name="fmap.meta">ignored_< /str>
< str name="fmap.content">_text_< /str>
< /requestHandler>1 year, 9 months ago #29170
PDF is one of the many text formats extracted by the Tika library, in Solr.
The way it is done is show in my previous answer, with the parameter
ignored_in solrconfig.xml installed by WPSOLR on your index, which is Solr’s way to ask for no meta data extracted from the PDF.
This is a default setup, but you can tweak it eventually as discussed in https://stackoverflow.com/questions/47934257/importing-files-with-solr-cell-tika-is-mixing-metadata-fields-with-content
You must be logged in to reply to this topic.