For more info about Weaviate, check out our documentation.
If you’ve ever wanted to use Weaviate but were worried that you couldn’t use the most efficient or relevant vectorization model you want, I have just the thing for you.
In this notebook/guide, I have detailed the different steps and code needed to setup Weaviate with any Huggingface vectorization model.
TLDR
- Choose from a wide selection of Huggingface models using the official rankings page.
- Create your own transformers inference container to be used by Weaviate to vectorize the data. Learn how to add your chosen Huggingface model..
- Startup the containers and create the class that will use your new vectorizer model.
- Send the data to the Weaviate that will now be automatically vectorized by the custom model.
Voilà! You can now send queries to your Weaviate installation that will return the correct and relevant indexed objects.
Detail
Or read the following :
Create the containers
FROM semitechnologies/transformers-inference:custom
RUN MODEL_NAME=BAAI/bge-small-en-v1.5 ./download.py
version: '3.4'
services:
weaviate:
command:
- --host
- 0.0.0.0
- --port
- '8080'
- --scheme
- http
image: semitechnologies/weaviate:1.21.3
ports:
- 8080:8080
volumes:
- ./weaviate_data:/var/lib/weaviate
restart: on-failure:0
environment:
TRANSFORMERS_INFERENCE_API: 'https://t2v-transformers:8080'
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
DEFAULT_VECTORIZER_MODULE: 'text2vec-transformers'
ENABLE_MODULES: 'text2vec-transformers'
CLUSTER_HOSTNAME: 'node1'
t2v-transformers:
build:
context: .
dockerfile: Dockerfile_custom_model
image: weaviate_custom_model
environment:
ENABLE_CUDA: '0'
ENABLE_CUDA: '1'
docker-compose up -d
The weaviate container has been started with the url : https://localhost:8080.
WARNING : The container doesn’t have any type of security since this guide is setup for testing purposes. If you want to setup a Weaviate production server, you should add authentification and TLS protection : https://weaviate.io/developers/weaviate/configuration/authentication.
Connect to Weaviate
pip install weaviate-client
import weaviate
client = weaviate.Client(
url="https://localhost:8080",
)
Create a class
class_obj = {
"class": "Sentences",
"vectorizer": "none",
"properties": [
{
"name": "passage",
"dataType": ["text[]"]
},
{
"name": "answer",
"dataType": ["text[]"]
},
{
"name": "query",
"dataType": ["text"]
},
],
'vectorizer': 'text2vec-transformers'
}
client.schema.create_class(class_obj)
client.schema.delete_class("Sentences")
Load the data
The properties of the object will be displayed to the user when sending a query and finding a match. The ms_marco dataset has “answer”, “query” and “passage” columns so if you use your own data you can have whatever property and how many or little you need.
In this step, you can create the embeddings and add them as well as the properties to the elements array.
from datasets import load_dataset
# Import the "ms_marco" dataset and load the sentences
dataset = load_dataset("ms_marco", 'v1.1')
passages_data = dataset["train"]['passages']
answers_data = dataset["train"]['answers']
query_data = dataset["train"]['query']
elements = []
# Select the 50 first sentences of the dataset
for i in range(50):
element = {}
passage = passages_data[i]['passage_text']
answer = answers_data[i]
query = query_data[i]
# Create the respective embedding
element["Passage"] = passage
element["Answer"] = answer
element["Query"] = query
elements.append(element)
Import data
client.batch.configure(batch_size=10) # Create a batch of 10
with client.batch as batch:
# Batch import all Questions
for i, d in enumerate(elements):
properties = {
"passage": d["Passage"],
"answer": d["Answer"],
"query": d["Query"],
}
batch.add_data_object(properties, "Sentences")
Send queries
import json
query = "What to do with food"
nearText = {
"concepts": [query],
"distance": 0.6
}
result = client.query.get(
"Sentences", ["passage", "answer", "query"]
).with_near_text(
nearText
).with_limit(2).with_additional(['certainty']).do()
print(json.dumps(result, indent=4))