– Chapter 1: embeddings –
After years of exploiting the last layer of ML models, somebody discovered with BERT that the gem was in fact the before last layer, which we now call embeddings.
Because embeddings could encapsulate semantic, indeed.
– Chapter 2: vector databases from vector metrics –
But, even deeper, because embeddings are vectors, and vector spaces bring vector metrics.
With vector metrics you can compare, and therefore cluster, things that are not comparable else. How to compare 2 sentences, 2 images, a sentence to an image…
And so, vector databases were born to store vectors and use vector metrics to find related vectors and concepts.
– Chapter 3: embed embeddings –
It remains difficult to build the embeddings, despite the rise of Python libraries like the tensor frameworks and HuggingFace transformers.
Therefore, some vector databases brought them as vectorizers. Rather than indexing vectors produced elsewhere, they accept texts and images to be vectorized internally.
– Chapter 4: train your embeddings –
This is the last chapter of the so-called “VectorOps” cycle.
While building embeddings from LLMs is already impressive, why not build your LLMs, sub-trained (fine-tuned) on your data?
– Conclusion –
Today, I can already feed WooCommerce product texts and images to a CLIP model inside Weaviate or Vespa.
I can also train my CLIP model elsewhere, with hard work, then use it.
But I cannot push on a button to get my CLIP model fine-tuned periodically on my products and user events.
WPSOLR + Weaviate + Vespa: https://www.wpsolr.com