This is the first article of a series on “Artificial Intelligence search” or “How to get closer to Google search accuracy with Elasticsearch”.
Here we discuss generalities about search, Artificial Intelligence, and Artificial Intelligence applied to search.
Next chapters will add much more details on each of these topics.
Lucene: lack of semantic search?
Lucene search, and therefore Elasticsearch and Solr, is based on syntax rather than semantic. This means that documents are scored on a similarity principle: the more the keywords look like the document content, the higher the score. (We will not mention term frequency here)
Of course this can be improved with dedicated analysers. Analysers are plugins written in java that modify documents and keywords to get specific results for specific use cases. For instance, applying lowercase to be able to get results for both “Cat” and “cat”. Or stemming (same root) to retrieve “cats” and “cat”. Or synonyms to retrieve “cat” and “feline”. And so on.
But, those analysers are still working on syntax. They do not understand the context of words, their meaning, or in a very limited manner. They are tailored made, sometimes over decades of collective work, on a restricted domain or language. They often contain mistakes obvious for a human eye, and are not easily or frequently updated.
NLP: a better Lucene analyser?
An alternative is Natural Language Processing (NLP).
This is the science of human language for computers.
It has been developed for decades, but with limited practical results until the last few years.
All the theory was there, sometimes for 50 years, but the lack of computer power and public data stopped practical applications.
Until Deep Learning came out of almost nowhere.
RTL: rank better than Lucene?
Elasticsearch (Lucene), eventually equipped with NLP analysers, is good at retrieving data. But this is only the first stage of the search retrieval process.
The second stage is even more difficult: the ranking of results.
After finding let say 1000 results from your documents, one has to sort them by order of relevancy to only show the best 10 results.
If one wants to tweak the ranking for specific queries, then things get difficult. You can use boosts, re-ranking functions to do so, but this is a never ending process as it has to be done query per query. And, as for piano tuning, a fine tuning on one query can un-tune other queries. Also, this is a manual process.
Is there a more systematic way of tuning the ranking? Something a bit more automatic, able to generalize the ranking of a set of query examples to a larger set of queries?
This is the purpose of Learning to Rank (LTR).
Learning to Rank is a class of algorithms, based on Machine Learning, with the ability to re-rank results.
First you provide query examples (called judgements): a query, a document, a numeric judgement of the quality of this query-document matching, and some features.
Features are properties applied to the judgment, based on the query alone (length of query, number of tokens), the document (notation of the movie, how many times the product was bought), or both (number of times the document was clicked/bought/wishlisted/visited on after the query was performed).
Features will help the RTL algorithm to build the ranking model: features are seen as numeric parameters that the Learning to Rank algorithm will optimize to produce the ranking model (you can consider the ranking model as a function with a set of parameters optimized for the examples).
Finally, the LTR model is used at query time to rerank results as expected.
More details on Learning to Rank with Elasticsearch (but also Solr and Algolia) in the next chapters.
Deep Learning: what is it?
Deep learning (DL) is a branch of Machine Learning (ML).
It is also called Artificial Intelligence (AI) or Multi-Layer Neural Network.
You must have heard of it with Google’s spectacular Deep Mind program crushing Go players, then Chess players, then Starcraft players.
But it is also behind the autonomous cars, face recognition, Speech recognition, and assistants like Alexa or Siri.
Deep Learning vs Machine Learning
But Deep Learning is just a special implementation of ML. And ML is not new, far from it.
For instance, classification from training is a ML subject of research for decades.
Many libraries in Python, or R, are built to solve ML problems for many years. So, what’s new?
Well, ML goal is to find results from data and rules. But what happens when no clear rules can be used, like in image recognition of cats? Because a(ny) cat cannot be abstracted from a collection of rules.
But Deep Learning’s goal is to discover those “cats” rules from “cats” data and “cats” expected results. With DL, if you have enough cats images and their “cat/not cat” result, the program is able in a certain way to abstract what is the essence of a cat (the “cats” rules). No need to rely on an army of “cat experts” anymore.
End of part I
In this article, we have briefly presented some key concepts on Artificial Intelligence, and current limitations of Lucene search:
- Syntax vs semantic search
Lucene works on syntax (words), but not semantic (relation between words) to retrieve results. This is where Natural Language Processing can help.
- Query independant vs query dependant ranking
Lucene uses statistics to rank results the same way for all queries (query independant). Learning to Rank can generalize a set of query-document pairs to produce a tailored ranking of results for each query. For instance to rank on users clickstreams.
In the next article II, we will present a series of links to great Deep Learning tutorials.
See you soon.