keywords in text followd by a colon not found with SOLR
- webmasterbslnlParticipant2 years, 3 months ago #29736
Hi,
We have texts with post name titles that contain words followed by a colon. Like this:
post_title = ‘Wetenschap: Ouderen na verpleeghuisopname even gelukkig als daarvoor’
Searching for ‘Wetenschap’ renders less results than ‘Wetenschap:’. Compare these:
https://www.nursing.nl/?s=wetenschap%3A+eiwitrijke+operatie (1 result)
https://www.nursing.nl/?s=wetenschap+eiwitrijke+operatie (no result)
If I perform the query withot colon on my SOLR index directly I get 1 result:
https://mysolrserver.com:8983/solr/#/prd-nursing-nl/query?q=wetenschap%20eiwitrijke%20operatie&q.op=AND&indent=true
What can I change in WPSolr or to have it ignore trailing colons?
Thanks for your advice,
HarmenwpsolrKeymaster2 years, 3 months ago #29738Can you activate the Query Monitor extension, so you can check the WPSOLR’s Solr query?
wpsolrKeymaster2 years, 3 months ago #29739As a fix, you could remove “:” at indexing time, by adding a charfilter on type “text_lws” in your Solr schema.xml file as described at https://solr.apache.org/guide/solr/latest/indexing-guide/charfilterfactories.html
<field name="title" type="text_lws" indexed="true" stored="true"/>
<fieldType name="text_lws" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> YOUR FILTER HERE </analyzer> </fieldType>
webmasterbslnlParticipant2 years, 3 months ago #29795Hi, Thanks for your suggestions.
The first I tried but I can’t activate the Query monitor add-on because the plugin thinks it is unlicensed (It shows “Save Options / (Feature-limited version, click to activate)” instead of a save button.) Other addons work fine. Strange.
I was able to log the query in a WordPress debug.log file. This is a snippet of what it said when searching with a trailing colon:Index: "/solr/acc-nursing-nl" Nb results shown: 1 Total nb results: 1 Speed: 497 ms Query: { "options": { "handler": "select", "resultclass": "Solarium\\QueryType\\Select\\Result\\Result", "documentclass": "Solarium\\QueryType\\Select\\Result\\Document", "query": "(wetenschap\\:\\ eiwitrijke\\ operatie)", "start": 0, "rows": 10, "fields": "*,score", "omitheader": true, "querydefaultoperator": "AND" },
And this is searching without the colon:
Index: "/solr/acc-nursing-nl" Nb results shown: 1 Total nb results: 1 Speed: 497 ms Query: { "options": { "handler": "select", "resultclass": "Solarium\\QueryType\\Select\\Result\\Result", "documentclass": "Solarium\\QueryType\\Select\\Result\\Document", "query": "(wetenschap\\:\\ eiwitrijke\\ operatie)", "start": 0, "rows": 10, "fields": "*,score", "omitheader": true, "querydefaultoperator": "AND" },
I have not added the full debug log here. It does not list the calls being made to the SOLR backend but I got those from SOLR logs:
2022-05-30 20:29:53.344 INFO (qtp1394940518-24) [ x:prd-nursing-nl] o.a.s.c.S.Request [prd-nursing-nl] webapp=/solr path=/select params={hl=true&fl=id,PID,type,meta_type_s,title,numcomments,comments,displaydate,displaymodified,*categories_str,author,*post_thumbnail_href_str,*post_href_str,snippet_s&q.op=AND&fq={!tag%3Dfct_excl_type}type:("post"+OR+"blog"+OR+"magazine"+OR+"magazine-article"+OR+"video")&fq=-post_status_s:("draft"+OR+"pending"+OR+"trash"+OR+"future"+OR+"private"+OR+"auto-draft")&fq=(*:*+-is_excluded_s:[*+TO+*])+OR+is_excluded_s:(n)&f.content.hl.simple.pre=<b>&defType=edismax&qf=content+title^2.5+categories^2+tags+introduction_str^2.5&hl.fl=title,content,comments&wt=json&f.content.hl.fragsize=400&facet.field={!key%3Dtheme_id_str}theme_id_str&facet.field={!key%3Dtags}tags&facet.field={!key%3Dtype}type&facet.field={!key%3Dcategories_str}categories_str&json.nl=flat&start=0&f.content.hl.simple.post=</b>&f.title.hl.fragsize=400&sort=date+desc&f.comments.hl.simple.pre=<b>&rows=10&f.title.hl.simple.pre=<b>&f.title.hl.simple.post=</b>&q=(wetenschap\:\+eiwitrijke\+operatie)&facet.limit=10&omitHeader=true&f.comments.hl.fragsize=400&f.comments.hl.simple.post=</b>&facet.mincount=1&facet=true} hits=1 status=0 QTime=1
and
2022-05-30 20:28:46.728 INFO (qtp1394940518-22) [ x:prd-nursing-nl] o.a.s.c.S.Request [prd-nursing-nl] webapp=/solr path=/select params={hl=true&fl=id,PID,type,meta_type_s,title,numcomments,comments,displaydate,displaymodified,*categories_str,author,*post_thumbnail_href_str,*post_href_str,snippet_s&q.op=AND&fq={!tag%3Dfct_excl_type}type:("post"+OR+"blog"+OR+"magazine"+OR+"magazine-article"+OR+"video")&fq=-post_status_s:("draft"+OR+"pending"+OR+"trash"+OR+"future"+OR+"private"+OR+"auto-draft")&fq=(*:*+-is_excluded_s:[*+TO+*])+OR+is_excluded_s:(n)&f.content.hl.simple.pre=<b>&defType=edismax&qf=content+title^2.5+categories^2+tags+introduction_str^2.5&hl.fl=title,content,comments&wt=json&f.content.hl.fragsize=400&facet.field={!key%3Dtheme_id_str}theme_id_str&facet.field={!key%3Dtags}tags&facet.field={!key%3Dtype}type&facet.field={!key%3Dcategories_str}categories_str&json.nl=flat&start=0&f.content.hl.simple.post=</b>&f.title.hl.fragsize=400&sort=date+desc&f.comments.hl.simple.pre=<b>&rows=10&f.title.hl.simple.pre=<b>&f.title.hl.simple.post=</b>&q=(wetenschap\+eiwitrijke\+operatie)&facet.limit=10&omitHeader=true&f.comments.hl.fragsize=400&f.comments.hl.simple.post=</b>&facet.mincount=1&facet=true} hits=0 status=0 QTime=0
Do you still think removing trailing colons at indexing time is the solution?
Thanks for your help,
HarmenwpsolrKeymaster2 years, 3 months ago #29796Your search is boosting the “title” field, which uses the field type “text_lws”, a light analyser which does not remove stop words.
You could change the type of title from “text_lws”:
<field name="title" type="text_lws" indexed="true" stored="true"/>
to “text”:
<field name="title" type="text" indexed="true" stored="true"/>
Or create a new type and apply it to “title”.
Or you can also not use the boosts.webmasterbslnlParticipant2 years, 3 months ago #29800Thank you, your suggestion on changing from the lightweight analyser to the standard one solved it!
Kind regards, HarmenwpsolrKeymaster2 years, 3 months ago #29802You also may have noticed that the default analyser is English.
You could update it to Dutch to improve search results:
https://solr.apache.org/guide/solr/latest/indexing-guide/language-analysis.html#dutchwebmasterbslnlParticipant2 years, 3 months ago #29803We use This in our config:
<!-- Default Dutch analyser --> <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="Kp"></filter> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_nl.txt" /> <filter class="solr.StemmerOverrideFilterFactory" dictionary="stemdict_nl.txt" ignoreCase="false"/> <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="Kp"></filter> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_nl.txt" /> <filter class="solr.StemmerOverrideFilterFactory" dictionary="stemdict_nl.txt" ignoreCase="false"/> <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/> </analyzer> </fieldType>
“Kp” is a alternate Dutch language stemmer that seems to work better than the default Dutch one. But it is hard to tell without thorough testing: https://snowball.tartarus.org/algorithms/kraaij_pohlmann/stemmer.html
More info on Dutch texts in SOLR I found here: https://dropsolid.io/knowledge-hub/solr-search-and-multilingual-content-drupal
You must be logged in to reply to this topic.