meta data shown in search results
- YannParticipant3 years, 2 months ago #29159
Hi,
since I upgraded, indexed results exerpt shows meta data garbage:L’Avent – l’amour
Modified”,, “access_permission:can_modify”,, “pdf:docinfo:created”,]} « Croyons au nom de son Fils Jésus-Christ et aimons-nous les uns les autres, comme il nous l’a ordonnéWhat can we do to keep only the document text as it was before?
Thanks.
- This topic was modified 3 years, 2 months ago by wpsolr.
wpsolrKeymaster3 years, 2 months ago #29160This is a standard Solr Tika feature unfortunately.
For instance: https://stackoverflow.com/questions/47934257/importing-files-with-solr-cell-tika-is-mixing-metadata-fields-with-contentwpsolrKeymaster3 years, 2 months ago #29163In the solrconfig.xml installed by WPSOLR with each new index, meta informations are marked to be ignored:
< requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" >
< lst name="defaults">
< str name="lowernames">true< /str>
< str name="fmap.meta">ignored_< /str>
< str name="fmap.content">_text_< /str>
< /lst>
< /requestHandler>wpsolrKeymaster3 years, 2 months ago #29170PDF is one of the many text formats extracted by the Tika library, in Solr.
The way it is done is show in my previous answer, with the parameter
ignored_ in solrconfig.xml installed by WPSOLR on your index, which is Solr’s way to ask for no meta data extracted from the PDF.This is a default setup, but you can tweak it eventually as discussed in https://stackoverflow.com/questions/47934257/importing-files-with-solr-cell-tika-is-mixing-metadata-fields-with-content
You must be logged in to reply to this topic.