Load and search with a multilingual model

In this example we'll use a multilingual embedding model multilingual-e5-base to perform search on a dataset of mixed language documents. Using this model, we can search in two ways:

Across languages, for example using a query in German to find documents in English
Within a non-English language, for example using a query in German to find documents in German

While this example is using dense retrieval only, it's possible to also combine dense and traditional lexical retrieval with hybrid search. For more information on lexical multilingual search, please see the blog post Multilingual search using language identification in Elasticsearch.

The dataset used contains snippets of Wikipedia passages from the MIRACL dataset.

For this example, you will need:

Python 3.6 or later
An Elastic deployment with a machine learning node
- We'll be using Elastic Cloud for this example (available with a free trial)
The Elastic Python client

Create Elastic Cloud deployment

If you don't have an Elastic Cloud deployment, sign up here for a free trial.

Once logged in to your Elastic Cloud account, go to the Create deployment page and select Create deployment. Leave all settings with their default values.

To get started, we'll need to connect to our Elastic deployment using the Python client. Because we're using an Elastic Cloud deployment, we'll use the Cloud ID to identify our deployment.

First we need to pip install the packages we need for this example.

Next we need to import the elasticsearch module and the getpass module. getpass is part of the Python standard library and is used to securely prompt for credentials.

Now we can instantiate the Python Elasticsearch client. First we prompt the user for their password and Cloud ID.

🔐 NOTE: getpass enables us to securely prompt the user for credentials without echoing them to the terminal, or storing it in memory.

Then we create a client object that instantiates an instance of the Elasticsearch class.

Enable Telemetry

Knowing that you are using this notebook helps us decide where to invest our efforts to improve our products. We would like to ask you that you run the following code to let us gather anonymous usage statistics. See telemetry.py for details. Thank you!

Test the Client

Before you continue, confirm that the client has connected with this test.

{'name': 'instance-0000000011', 'cluster_name': 'd1bd36862ce54c7b903e2aacd4cd7f0a', 'cluster_uuid': 'tIkh0X_UQKmMFQKSfUw-VQ', 'version': {'number': '8.11.1', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '6f9ff581fbcde658e6f69d6ce03050f060d1fd0c', 'build_date': '2023-11-11T10:05:59.421038163Z', 'build_snapshot': False, 'lucene_version': '9.8.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'}

Refer to https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new to learn how to connect to a self-managed deployment.

Read https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/connecting.html#connect-self-managed-new to learn how to connect using API keys.

We need to add a field to support dense vector storage and search. Note the passage_embedding field below, which is used to store the dense vector representation of the passage field.

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'articles'})

Dataset

Let's index some data. Note that we are embedding the passage field using the sentence transformer model. Once indexed, you'll see that your documents contain a passage_embedding field ("type": "dense_vector") which contains a vector of floating point values. This is the embedding of the passage field in vector space. We'll use this field to perform semantic search using kNN.

Index documents

Our dataset is a Python list that contains dictionaries of passages from Wikipedia articles in two languages. We'll use the client.bulk method to index our documents in batches.

The following code iterates over the articles and creates a list of actions to be performed. Each action is a dictionary containing an "index" operation on our Elasticsearch index. The passage is encoded using our selected model, and the encoded vector is added to the article document. Note that the E5 models require that a prefix instruction is used "passage: " to tell the model that it is to embed a passage. On the query side, the query string will be prefixed with "query: ". The article document is then added to the list of operations.

Finally, we call the bulk method, specifying the index name and the list of actions.

ObjectApiResponse({'errors': False, 'took': 49, 'items': [{'index': {'_index': 'articles', '_id': 'XvGt7osBeCQuLJUsDG6m', '_version': 1, 'result': 'created', 'forced_refresh': True, '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 0, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'articles', '_id': 'X_Gt7osBeCQuLJUsDG6m', '_version': 1, 'result': 'created', 'forced_refresh': True, '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 1, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'articles', '_id': 'YPGt7osBeCQuLJUsDG6m', '_version': 1, 'result': 'created', 'forced_refresh': True, '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 2, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'articles', '_id': 'YfGt7osBeCQuLJUsDG6m', '_version': 1, 'result': 'created', 'forced_refresh': True, '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 3, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'articles', '_id': 'YvGt7osBeCQuLJUsDG6m', '_version': 1, 'result': 'created', 'forced_refresh': True, '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 4, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'articles', '_id': 'Y_Gt7osBeCQuLJUsDG6m', '_version': 1, 'result': 'created', 'forced_refresh': True, '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 5, '_primary_term': 1, 'status': 201}}, {'index': {'_index': 'articles', '_id': 'ZPGt7osBeCQuLJUsDG6m', '_version': 1, 'result': 'created', 'forced_refresh': True, '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 6, '_primary_term': 1, 'status': 201}}]})

The query function shown below can search for a given text in the dataset, with the query text given in any language. The function supports an optional language argument, which when given, restricts the search to passages in the selected language.

To make it convenient to experient with this dataset, we will use the following function to format query resposes.

For those unfamiliar with German, here is a quick translation of search words used in the examples:

"health" -> "Gesundheit"
"wall" -> "Mauer"

The first example searches for a word in English.

ID: 9002#0 Language: de Title: Gesundheits- und Krankenpflege Passage: Die Gesundheits- und Krankenpflege als Berufsfeld umfasst die Versorgung und Betreuung von Menschen aller Altersgruppen, insbesondere kranke, behinderte und sterbende Erwachsene. Die Gesundheits- und Kinderkrankenpflege hat ihren Schwerpunkt in der Versorgung von Kindern und Jugendlichen. In beiden Fachrichtungen gehört die Verhütung von Krankheiten und Gesunderhaltung zum Aufgabengebiet der professionellen Pflege. Score: 0.8986236 ID: 8881#0 Language: en Title: Doctor (title) Passage: Doctor is an academic title that originates from the Latin word of the same spelling and meaning. The word is originally an agentive noun of the Latin verb "docēre" [dɔˈkeːrɛ] 'to teach'. It has been used as an academic title in Europe since the 13th century, when the first Doctorates were awarded at the University of Bologna and the University of Paris. Having become established in European universities, this usage spread around the world. Contracted "Dr" or "Dr.", it is used as a designation for a person who has obtained a Doctorate (e.g. PhD). In many parts of the world it is also used by medical practitioners, regardless of whether or not they hold a doctoral-level degree. Score: 0.8904184

Note that in the results above, we see that the document about healthcare, even though it's in German, matches better to the query "health", versus the English document which doesn't talk about health specifically but about doctors more generally. This is the power of a multilingual embedding which embeds meaning across languages.

The next example also searches for a word in English, but only retrieves results in German.

ID: 2270104#0 Language: de Title: London Wall Passage: London Wall ist die strategische Stadtmauer, die die Römer um Londinium gebaut haben, um die Stadt zu schützen, die über den wichtigen Hafen an der Themse verfügte. Bis ins späte Mittelalter hinein bildete diese Stadtmauer die Grenzen von London. Heute ist "London Wall" auch der Name einer Straße, die an einem noch bestehenden Abschnitt der Stadtmauer verläuft. Score: 0.8941858 ID: 2270104#1 Language: de Title: London Wall Passage: Die Mauer wurde Ende des zweiten oder Anfang des dritten Jahrhunderts erbaut, wahrscheinlich zwischen 190 und 225, vermutlich zwischen 200 und 220. Sie entstand somit etwa achtzig Jahre nach dem im Jahr 120 erfolgten Bau der Festung, deren nördliche und westliche Mauern verstärkt und in der Höhe verdoppelt wurden, um einen Teil der neuen Stadtmauer zu bilden. Die Anlage wurde zumindest bis zum Ende des vierten Jahrhunderts weiter ausgebaut. Sie zählt zu den letzten großen Bauprojekten der Römer vor deren Rückzug aus Britannien im Jahr 410. Score: 0.870095

In the final example, the query is given in German, and only German results are requested.

ID: 2270104#1 Language: de Title: London Wall Passage: Die Mauer wurde Ende des zweiten oder Anfang des dritten Jahrhunderts erbaut, wahrscheinlich zwischen 190 und 225, vermutlich zwischen 200 und 220. Sie entstand somit etwa achtzig Jahre nach dem im Jahr 120 erfolgten Bau der Festung, deren nördliche und westliche Mauern verstärkt und in der Höhe verdoppelt wurden, um einen Teil der neuen Stadtmauer zu bilden. Die Anlage wurde zumindest bis zum Ende des vierten Jahrhunderts weiter ausgebaut. Sie zählt zu den letzten großen Bauprojekten der Römer vor deren Rückzug aus Britannien im Jahr 410. Score: 0.88160384 ID: 2270104#0 Language: de Title: London Wall Passage: London Wall ist die strategische Stadtmauer, die die Römer um Londinium gebaut haben, um die Stadt zu schützen, die über den wichtigen Hafen an der Themse verfügte. Bis ins späte Mittelalter hinein bildete diese Stadtmauer die Grenzen von London. Heute ist "London Wall" auch der Name einer Straße, die an einem noch bestehenden Abschnitt der Stadtmauer verläuft. Score: 0.876139