This is the first article of a series on “Artificial Intelligence search” or “How to get closer to Google search accuracy with Elasticsearch”.
Lucene search, and therefore Elasticsearch and Solr, is based on syntax rather than semantic. This means that documents are scored on a similarity principle: the more the keywords look like the document content, the higher the score. (We will not mention term frequency here)
Of course this can be improved with dedicated analysers. Analysers are plugins written in java that modify documents and keywords to get specific results for specific use cases. For instance, applying lowercase to be able to get results for both “Cat” and “cat”. Or stemming (same root) to retrieve “cats” and “cat”. Or synonyms to retrieve “cat” and “feline”. And so on.
But, those analysers are still working on syntax. They do not understand the context of words, their meaning, or in a very limited manner. They are tailored made, sometimes over decades of collective work, on a restricted domain or language. They often contain mistakes obvious for a human eye, and are not easily or frequently updated.
An alternative is Natural Language Processing (NLP). This is the science of human language for computers. It has been developed for decades, but with limited practical results until the last few years. All the theory was there, sometimes for 50 years, but the lack of computer power and public data stopped practical applications. Until Deep Learning came out of almost nowhere.
Deep learning (DL) is a branch of Machine Learning (ML). It is also called Artificial Intelligence (AI) or Multi-Layer Neural Network. You must have heard of it with Google’s spectacular Deep Mind program crushing Go players, then Chess players, then Starcraft players. But it is also behind the autonomous cars, face recognition, Speech recognition, and assistants like Alexa or Siri.
Deep Learning vs Machine Learning
Suddenly, everything seems to be made with Deep Learning. But Deep Learning is just a special implementation of ML. And ML is not new, far from it. For instance, classification from training is a ML subject of research for decades. Many libraries in Python, or R, are built to solve ML problems for many years. So, what’s new?
Well, ML goal is to find results from data and rules. But what happens when no clear rules can be used, like in image recognition of cats? Because a(ny) cat cannot be abstracted from a collection of rules.
But Deep Learning’s goal is to discover those “cats” rules from “cats” data and “cats” expected results. With DL, if you have enough cats images and their “cat/not cat” result, the program is able in a certain way to abstract what is the essence of a cat (the “cats” rules). No need to rely on an army of “cat experts” anymore.
End of part I
In this article, we have briefly presented some key concepts on Artificial Intelligence, and current limitations of Lucene search.
In the next article II, we will present a series of links to great Deep Learning tutorials.
See you soon.