In this example we'll use a multilingual embedding model multilingual-e5-base to perform search on a dataset of mixed language documents. Using this model, we can search in two ways:
- Across languages, for example using a query in German to find documents in English
- Within a non-English language, for example using a query in German to find documents in German
While this example is using dense retrieval only, it's possible to also combine dense and traditional lexical retrieval with hybrid search. For more information on lexical multilingual search, please see the blog post Multilingual search using language identification in Elasticsearch.
The dataset used contains snippets of Wikipedia passages from the MIRACL dataset.
For this example, you will need:
- Python 3.6 or later
- An Elastic deployment with a machine learning node
- We'll be using Elastic Cloud for this example (available with a free trial)
- The Elastic Python client
Create Elastic Cloud deployment
If you don't have an Elastic Cloud deployment, sign up here for a free trial.
Once logged in to your Elastic Cloud account, go to the Create deployment page and select Create deployment. Leave all settings with their default values.
To get started, we'll need to connect to our Elastic deployment using the Python client. Because we're using an Elastic Cloud deployment, we'll use the Cloud ID to identify our deployment.
First we need to pip install the packages we need for this example.
Next we need to import the elasticsearch module and the getpass module.
getpass is part of the Python standard library and is used to securely prompt for credentials.
Now we can instantiate the Python Elasticsearch client. First we prompt the user for their password and Cloud ID.
đ NOTE: getpass enables us to securely prompt the user for credentials without echoing them to the terminal, or storing it in memory.
Then we create a client object that instantiates an instance of the Elasticsearch class.
Enable Telemetry
Knowing that you are using this notebook helps us decide where to invest our efforts to improve our products. We would like to ask you that you run the following code to let us gather anonymous usage statistics. See telemetry.py for details. Thank you!
Test the Client
Before you continue, confirm that the client has connected with this test.
Refer to https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new to learn how to connect to a self-managed deployment.
Read https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new to learn how to connect using API keys.
We need to add a field to support dense vector storage and search.
Note the passage_embedding field below, which is used to store the dense vector representation of the passage field.
ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'articles'})Dataset
Let's index some data.
Note that we are embedding the passage field using the sentence transformer model.
Once indexed, you'll see that your documents contain a passage_embedding field ("type": "dense_vector") which contains a vector of floating point values.
This is the embedding of the passage field in vector space.
We'll use this field to perform semantic search using kNN.
Index documents
Our dataset is a Python list that contains dictionaries of passages from Wikipedia articles in two languages.
We'll use the client.bulk method to index our documents in batches.
The following code iterates over the articles and creates a list of actions to be performed. Each action is a dictionary containing an "index" operation on our Elasticsearch index. The passage is encoded using our selected model, and the encoded vector is added to the article document. Note that the E5 models require that a prefix instruction is used "passage: " to tell the model that it is to embed a passage. On the query side, the query string will be prefixed with "query: ". The article document is then added to the list of operations.
Finally, we call the bulk method, specifying the index name and the list of actions.
ObjectApiResponse({'errors': False, 'took': 49, 'items': [{'index': {'_index': 'articles', '_id': 'XvGt7osBeCQuLJUsDG6m', '_version': 1, 'result': 'created', 'forced_refresh': True, '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 0, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'articles', '_id': 'X_Gt7osBeCQuLJUsDG6m', '_version': 1, 'result': 'created', 'forced_refresh': True, '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 1, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'articles', '_id': 'YPGt7osBeCQuLJUsDG6m', '_version': 1, 'result': 'created', 'forced_refresh': True, '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 2, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'articles', '_id': 'YfGt7osBeCQuLJUsDG6m', '_version': 1, 'result': 'created', 'forced_refresh': True, '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 3, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'articles', '_id': 'YvGt7osBeCQuLJUsDG6m', '_version': 1, 'result': 'created', 'forced_refresh': True, '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 4, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'articles', '_id': 'Y_Gt7osBeCQuLJUsDG6m', '_version': 1, 'result': 'created', 'forced_refresh': True, '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 5, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'articles', '_id': 'ZPGt7osBeCQuLJUsDG6m', '_version': 1, 'result': 'created', 'forced_refresh': True, '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 6, '_primary_term': 1, 'status': 201}}]})The query function shown below can search for a given text in the dataset, with the query text given in any language. The function supports an optional language argument, which when given, restricts the search to passages in the selected language.
To make it convenient to experient with this dataset, we will use the following function to format query resposes.
For those unfamiliar with German, here is a quick translation of search words used in the examples:
- "health" -> "Gesundheit"
- "wall" -> "Mauer"
The first example searches for a word in English.
Note that in the results above, we see that the document about healthcare, even though it's in German, matches better to the query "health", versus the English document which doesn't talk about health specifically but about doctors more generally. This is the power of a multilingual embedding which embeds meaning across languages.
The next example also searches for a word in English, but only retrieves results in German.
In the final example, the query is given in German, and only German results are requested.