keywords in text followd by a colon not found with SOLR

  • webmasterbslnl
    Participant
    2 years, 3 months ago #29736

    Hi,
    We have texts with post name titles that contain words followed by a colon. Like this:
    post_title = ‘Wetenschap: Ouderen na verpleeghuisopname even gelukkig als daarvoor’
    Searching for ‘Wetenschap’ renders less results than ‘Wetenschap:’. Compare these:
    https://www.nursing.nl/?s=wetenschap%3A+eiwitrijke+operatie (1 result)
    https://www.nursing.nl/?s=wetenschap+eiwitrijke+operatie (no result)
    If I perform the query withot colon on my SOLR index directly I get 1 result:
    https://mysolrserver.com:8983/solr/#/prd-nursing-nl/query?q=wetenschap%20eiwitrijke%20operatie&q.op=AND&indent=true
    What can I change in WPSolr or to have it ignore trailing colons?
    Thanks for your advice,
    Harmen

    wpsolr
    Keymaster
    2 years, 3 months ago #29738

    Can you activate the Query Monitor extension, so you can check the WPSOLR’s Solr query?

    wpsolr
    Keymaster
    2 years, 3 months ago #29739

    As a fix, you could remove “:” at indexing time, by adding a charfilter on type “text_lws” in your Solr schema.xml file as described at https://solr.apache.org/guide/solr/latest/indexing-guide/charfilterfactories.html

    <field name="title" type="text_lws" indexed="true" stored="true"/>

    <fieldType name="text_lws" class="solr.TextField" positionIncrementGap="100">
        <analyzer>
           <tokenizer class="solr.WhitespaceTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory"/>
           YOUR FILTER HERE
           </analyzer>
    </fieldType>
    webmasterbslnl
    Participant
    2 years, 3 months ago #29795

    Hi, Thanks for your suggestions.

    The first I tried but I can’t activate the Query monitor add-on because the plugin thinks it is unlicensed (It shows “Save Options / (Feature-limited version, click to activate)” instead of a save button.) Other addons work fine. Strange.
    I was able to log the query in a WordPress debug.log file. This is a snippet of what it said when searching with a trailing colon:

    Index: "/solr/acc-nursing-nl"
    Nb results shown: 1
    Total nb results: 1
    Speed: 497 ms 
    Query: {
        "options": {
            "handler": "select",
            "resultclass": "Solarium\\QueryType\\Select\\Result\\Result",
            "documentclass": "Solarium\\QueryType\\Select\\Result\\Document",
            "query": "(wetenschap\\:\\ eiwitrijke\\ operatie)",
            "start": 0,
            "rows": 10,
            "fields": "*,score",
            "omitheader": true,
            "querydefaultoperator": "AND"
        },

    And this is searching without the colon:

    Index: "/solr/acc-nursing-nl"
    Nb results shown: 1
    Total nb results: 1
    Speed: 497 ms 
    Query: {
        "options": {
            "handler": "select",
            "resultclass": "Solarium\\QueryType\\Select\\Result\\Result",
            "documentclass": "Solarium\\QueryType\\Select\\Result\\Document",
            "query": "(wetenschap\\:\\ eiwitrijke\\ operatie)",
            "start": 0,
            "rows": 10,
            "fields": "*,score",
            "omitheader": true,
            "querydefaultoperator": "AND"
        },

    I have not added the full debug log here. It does not list the calls being made to the SOLR backend but I got those from SOLR logs:

    2022-05-30 20:29:53.344 INFO (qtp1394940518-24) [ x:prd-nursing-nl] o.a.s.c.S.Request [prd-nursing-nl] webapp=/solr path=/select params={hl=true&fl=id,PID,type,meta_type_s,title,numcomments,comments,displaydate,displaymodified,*categories_str,author,*post_thumbnail_href_str,*post_href_str,snippet_s&q.op=AND&fq={!tag%3Dfct_excl_type}type:("post"+OR+"blog"+OR+"magazine"+OR+"magazine-article"+OR+"video")&fq=-post_status_s:("draft"+OR+"pending"+OR+"trash"+OR+"future"+OR+"private"+OR+"auto-draft")&fq=(*:*+-is_excluded_s:[*+TO+*])+OR+is_excluded_s:(n)&f.content.hl.simple.pre=<b>&defType=edismax&qf=content+title^2.5+categories^2+tags+introduction_str^2.5&hl.fl=title,content,comments&wt=json&f.content.hl.fragsize=400&facet.field={!key%3Dtheme_id_str}theme_id_str&facet.field={!key%3Dtags}tags&facet.field={!key%3Dtype}type&facet.field={!key%3Dcategories_str}categories_str&json.nl=flat&start=0&f.content.hl.simple.post=</b>&f.title.hl.fragsize=400&sort=date+desc&f.comments.hl.simple.pre=<b>&rows=10&f.title.hl.simple.pre=<b>&f.title.hl.simple.post=</b>&q=(wetenschap\:\+eiwitrijke\+operatie)&facet.limit=10&omitHeader=true&f.comments.hl.fragsize=400&f.comments.hl.simple.post=</b>&facet.mincount=1&facet=true} hits=1 status=0 QTime=1

    and
    2022-05-30 20:28:46.728 INFO (qtp1394940518-22) [ x:prd-nursing-nl] o.a.s.c.S.Request [prd-nursing-nl] webapp=/solr path=/select params={hl=true&fl=id,PID,type,meta_type_s,title,numcomments,comments,displaydate,displaymodified,*categories_str,author,*post_thumbnail_href_str,*post_href_str,snippet_s&q.op=AND&fq={!tag%3Dfct_excl_type}type:("post"+OR+"blog"+OR+"magazine"+OR+"magazine-article"+OR+"video")&fq=-post_status_s:("draft"+OR+"pending"+OR+"trash"+OR+"future"+OR+"private"+OR+"auto-draft")&fq=(*:*+-is_excluded_s:[*+TO+*])+OR+is_excluded_s:(n)&f.content.hl.simple.pre=<b>&defType=edismax&qf=content+title^2.5+categories^2+tags+introduction_str^2.5&hl.fl=title,content,comments&wt=json&f.content.hl.fragsize=400&facet.field={!key%3Dtheme_id_str}theme_id_str&facet.field={!key%3Dtags}tags&facet.field={!key%3Dtype}type&facet.field={!key%3Dcategories_str}categories_str&json.nl=flat&start=0&f.content.hl.simple.post=</b>&f.title.hl.fragsize=400&sort=date+desc&f.comments.hl.simple.pre=<b>&rows=10&f.title.hl.simple.pre=<b>&f.title.hl.simple.post=</b>&q=(wetenschap\+eiwitrijke\+operatie)&facet.limit=10&omitHeader=true&f.comments.hl.fragsize=400&f.comments.hl.simple.post=</b>&facet.mincount=1&facet=true} hits=0 status=0 QTime=0

    Do you still think removing trailing colons at indexing time is the solution?
    Thanks for your help,
    Harmen

    wpsolr
    Keymaster
    2 years, 3 months ago #29796

    Your search is boosting the “title” field, which uses the field type “text_lws”, a light analyser which does not remove stop words.
    You could change the type of title from “text_lws”:
    <field name="title" type="text_lws" indexed="true" stored="true"/>
    to “text”:
    <field name="title" type="text" indexed="true" stored="true"/>
    Or create a new type and apply it to “title”.
    Or you can also not use the boosts.

    wpsolr
    Keymaster
    2 years, 3 months ago #29799

    Option “Use partial keyword matches in results” on screen 2.1 is also a possibility.

    webmasterbslnl
    Participant
    2 years, 3 months ago #29800

    Thank you, your suggestion on changing from the lightweight analyser to the standard one solved it!
    Kind regards, Harmen

    wpsolr
    Keymaster
    2 years, 3 months ago #29802

    You also may have noticed that the default analyser is English.
    You could update it to Dutch to improve search results:
    https://solr.apache.org/guide/solr/latest/indexing-guide/language-analysis.html#dutch

    webmasterbslnl
    Participant
    2 years, 3 months ago #29803

    We use This in our config:

            <!-- Default Dutch analyser -->
            <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
               <analyzer type="index">
                   <tokenizer class="solr.StandardTokenizerFactory"/>
                   <filter class="solr.LowerCaseFilterFactory"/>
                   <filter class="solr.SnowballPorterFilterFactory" language="Kp"></filter>
                   <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_nl.txt" />
                   <filter class="solr.StemmerOverrideFilterFactory" dictionary="stemdict_nl.txt" ignoreCase="false"/>
                   <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
               </analyzer>
               <analyzer type="query">
                   <tokenizer class="solr.StandardTokenizerFactory"/>
                   <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
                   <filter class="solr.LowerCaseFilterFactory"/>
                   <filter class="solr.SnowballPorterFilterFactory" language="Kp"></filter>
                   <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_nl.txt" />
                   <filter class="solr.StemmerOverrideFilterFactory" dictionary="stemdict_nl.txt" ignoreCase="false"/>
                   <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
               </analyzer>
            </fieldType>

    “Kp” is a alternate Dutch language stemmer that seems to work better than the default Dutch one. But it is hard to tell without thorough testing: https://snowball.tartarus.org/algorithms/kraaij_pohlmann/stemmer.html
    More info on Dutch texts in SOLR I found here: https://dropsolid.io/knowledge-hub/solr-search-and-multilingual-content-drupal

Viewing 9 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic.