Do we need new retrieval benchmarks to include all features that BM25 cannot tackle but vectors can?


What is unique and I love foremost with Vespa.ai‘s presentations is their intensive use of benchmarks.

— Why is BM25 still relevant? —
But oddly to me BM25 is ranking quite well on all retrieval benchmarks, compared to all other fancy/exotic/hybrid vector architectures.

— Are benchmarks biased to BM25? —
Is that due to the benchmarks being biased towards keyword inverted index search? Do we need new benchmarks?

— eCommerce —
Because for e-commerce, vector search is not only beating BM25, but simply crushing it:
– Vector search search never returns “no results”
– Vector search can expose the tail-end of keywords
– Multi-language is completely solved
– Synonym keywords is completely solved
– Typo fix is completely solved
– Contextual meaning is completely solved
– and so on …

Let’s imagine we’ve added one or several of those features in the current benchmarks: BM25 would instantly score a ZERO !!

— Section added from comments ūüôā —
And those points are absolutely crucial for a good e-commerce search.

Even more if you replace the search bar (where we are used to use simplified syntax like “comfortable shoes”) with a search ChatBot (“I have sensitive feet, I need to run a marathon tomorrow. What can you propose in stocks for less than $50, with a flashy color? I forgot, I hate red.”)

Example with “sensitive feet with bright color”:
РDemo with a keyword search, yielding all bad results: https://lnkd.in/eEQThWEm
РSame demo with a vector search, yielding some good results: https://lnkd.in/e6tn7hgg

I cannot find out in what aspects the benchmarks are valid in the context of a “real” e-commerce.

You can try all those points with WPSOLR’s¬†WooCommerce¬†Weaviate¬†(soon¬†Vespa.ai?) demos:¬†https://lnkd.in/dzucnPtZ

