Serverless semantic search with ELSER in Python: Exploring Summer Olympic games history

This blog shows how to fetch information from an Elasticsearch index, in a natural language expression, using semantic search. We will load previous olympic games data set and then use the ELSER model to perform semantic searches.

This blog shows how to fetch information from an Elasticsearch index, in a natural language expression, using semantic search. We will create a serverless Elasticsearch project, load previous olympic games data set into an index, generate inferred data (in a sparse vector field) using the inference processor along with ELSER model, and finally search for historical olympic competition information in a natural language expression, thanks to text expansion query.

The tools and the data set

For this project we will use an Elasticsearch serverless project, and the serverless Python client (elasticsearch_serverless) for interactions with Elasticsearch. To create a serverless project, simply follow the get started with serverless guide. More information on serverless including pricing can be found here.

When setting up a serverless project, be sure to select the option for Elasticsearch and the general purpose option for working this tutorial.

The data set used is that of summer olympic games competitors from 1896 to 2020, obtained from Kaggle (Athletes_summer_games.csv). It contains information about the competition year, the type of competition, the name of the participant, whether they won a medal or not and which medal eventually, along with other information.

For the data set manipulation, we will use Eland, a Python client and toolkit for DataFrames and machine learning in Elasticsearch.

Finally the natural language processing (NLP) model used is Elastic Learned Sparse EncodeR (ELSER), a retrieval model trained by Elastic that allows to retrieve more relevant search results through semantic search.

Before following the steps below, please make sure you have installed the serverless Python client and Eland.

pip install elasticsearch_serverless
pip install eland

Please note the versions I used below. If you are not using the same versions, you might need to adjust the code to any eventual syntax change in the versions you are using.

➜  ~ python3 --version
Python 3.9.6
➜  ~ pip3 list | grep -E 'elasticsearch-serverless|eland'
eland                     8.14.0
elasticsearch-serverless  0.3.0.20231031

Download and deploy ELSER model

We will use the Python client to download and deploy the ELSER model. Before doing that, let's first confirm that we can connect to our serverless project. The URL and API key below are read from environment variables; you need to use the appropriate values in your case, or use whichever method you prefer for reading credentials.

from elasticsearch_serverless import Elasticsearch
from os import environ


serverless_endpoint = environ.get("SERVERLESS_ENDPOINT_URL")
serverless_api_key = environ.get("SERVERLESS_API_KEY")


client = Elasticsearch(
 serverless_endpoint,
 api_key=serverless_api_key
)


client.info()

If everything is properly configured, you should get an output like below:

ObjectApiResponse({'name': 'serverless', 'cluster_name': 'd6c6698e28c34e58b6f858df9442abac', 'cluster_uuid': 'hOuAhMUPQkumEM-PxW_r-Q', 'version': {'number': '8.11.0', 'build_flavor': 'serverless', 'build_type': 'docker', 'build_hash': '00000000', 'build_date': '2023-10-31', 'build_snapshot': False, 'lucene_version': '9.7.0', 'minimum_wire_compatibility_version': '8.11.0', 'minimum_index_compatibility_version': '8.11.0'}, 'tagline': 'You Know, for Search'})

Now that we've confirmed that the Python client is successfully connecting to the serverless Elasticsearch project, let’s download and deploy the ELSER model. We will check if the model was previously deployed and delete it in order to perform a fresh install. Also, as the deploy phase could take a few minutes, we will continuously check the model configuration information to make sure that the model definition is present before moving to the next phase. For more information check the Get trained models API.

from elasticsearch_serverless import Elasticsearch, exceptions
import time


# delete model if already downloaded and deployed
try:
   client.ml.delete_trained_model(model_id=".elser_model_2", force=True)
   print("Model deleted successfully, We will proceed with creating one")
except exceptions.NotFoundError:
   print("Model doesn't exist, but We will proceed with creating one")


# Creates the ELSER model configuration. Automatically downloads the model if it doesn't exist.
client.ml.put_trained_model(
   model_id=".elser_model_2",
   input={
       "field_names": [
           "concatenated_textl"
       ]
   }
)


# Check the download and deploy progress
while True:
   status = client.ml.get_trained_models(
       model_id=".elser_model_2", include="definition_status"
   )


   if status["trained_model_configs"][0]["fully_defined"]:
       print("ELSER Model is downloaded and ready to be deployed.")
       break
   else:
       print("ELSER Model is downloaded but not ready to be deployed.")
   time.sleep(5)

Once we get the confirmation that the model is downloaded and ready to be deployed, we can go ahead and start ELSER. It can take a little while to fully be ready to be deployed.

# A function to check the model's routing state
# https://www.elastic.co/guide/en/elasticsearch/reference/current/get-trained-models-stats.html
def get_model_routing_state(model_id=".elser_model_2"):
   try:
       status = client.ml.get_trained_models_stats(
           model_id=".elser_model_2",
       )
       return status["trained_model_stats"][0]["deployment_stats"]["nodes"][0]["routing_state"]["routing_state"]
   except:
       return None


# If ELSER is already started, then we are fine.
if get_model_routing_state(".elser_model_2") == "started":
   print("ELSER Model has been already deployed and is currently started.")


# Otherwise, we will deploy it, and monitor the routing state to make sure it is started.
else:
   print("ELSER Model will be deployed.")


   # Start trained model deployment
   client.ml.start_trained_model_deployment(
       model_id=".elser_model_2",
       number_of_allocations=16,
       threads_per_allocation=4,
       wait_for="starting"
   )


   while True:
       if get_model_routing_state(".elser_model_2") == "started":
           print("ELSER Model has been successfully deployed.")
           break
       else:
           print("ELSER Model is currently being deployed.")
       time.sleep(5)

Load the data set into Elasticsearch using Eland

eland.csv_to_eland allows reading a comma-separated values (csv) file into a data frame stored in an Elasticsearch index. We will use it to load the Olympics data (Athletes_summer_games.csv) into Elasticsearch. The es_type_overrides allows to override default mappings.

import eland as ed


index="elser-olympic-games"
csv_file="Athletes_summer_games.csv"


ed.csv_to_eland(
   csv_file,
   es_client=client,
   es_dest_index=index,
   es_if_exists='replace',
   es_dropna=True,
   es_refresh=True,
   index_col=0,
   es_type_overrides={
       "City": "text",
       "Event": "text",
       "Games": "text",
       "Medal": "text",
       "NOC": "text",
       "Name": "text",
       "Season": "text",
       "Sport": "text",
       "Team": "text"
   }
)

After executing the lines above, the data will be written in the index elser-olympic-games. You can also retrieve the resulting dataframe (eland.DataFrame) into a variable for further manipulations.

Create an ingest pipeline for inference based on ELSER

The next step in our journey to explore past Olympic competition data using semantic search is to create an ingest pipeline containing an inference processor that runs the ELSER model. A set of fields has been selected and concatenated into a single field on which the inference processor will work. Depending on your use case, you might want to use another strategy.

The concatenation is done using the script processor. The inference processor uses the previously deployed ELSER model, taking as input the concatenated field, and storing the output in a sparse vector type field (see following point).

client.ingest.put_pipeline(
   id="elser-ingest-pipeline",
   description="Ingest pipeline for ELSER",
   processors=[
       {
           "script": {
           "description": "Concatenate some selected fields value into `concatenated_text` field",
           "lang": "painless",
           "source": """
               ctx['concatenated_text'] = ctx['Name'] + ' ' + ctx['Team'] + ' ' + ctx['Games'] + ' ' + ctx['City'] + ' ' + ctx['Event'];
           """
           }
       },
       {
           "inference": {
               "model_id": ".elser_model_2",
               "ignore_missing": True,
               "input_output": [
                   {
                       "input_field": "concatenated_text",
                       "output_field": "concatenated_text_embedding"
                   }
               ]
           }
       }
   ]
)

Preparing the index

This is the last stage before being able to query past Olympic competition data using natural language expressions. We will update the previously created index’s mapping adding a sparse vector type field.

Update the mapping: add a sparse vector field

We will update the index mapping by adding a field that will hold the concatenated data, and a sparse vector field that will hold the inferred information computed by the inference processor using the ELSER model.

index="elser-olympic-games"


mappings_properties={
   "concatenated_text": {
       "type": "text"
   },
   "concatenated_text_embedding": {
       "type": "sparse_vector"
   }
}


client.indices.put_mapping(
   index=index,
   properties=mappings_properties
)

Populate the sparse vector field

We will run an update by query to call the previously created ingest pipeline in order to populate the sparse vector field in each document.

client.update_by_query(
   index="elser-olympic-games",
   pipeline="elser-ingest-pipeline",
   wait_for_completion=False
)

The request will take a few moments depending on the number of documents, and the number of allocations and threads per allocation used for deploying ELSER. Once this step is completed, we can now start exploring past olympic data set using semantic search.

Now we will use text expansion queries to retrieve information about past Olympic game competitions using natural language expressions. Before going to the demonstration, let's create a function to retrieve and format the search results.

def semantic_search(search_text):
   response = client.search(
       index="elser-olympic-games",
       size=3,
       query={
           "bool": {
           "must": [
               {
                   "text_expansion": {
                       "concatenated_text_embedding": {
                       "model_id": ".elser_model_2",
                       "model_text": search_text
                       }
                   }
               },
               {
                   "exists": {
                       "field": "Medal"
                   }
               }
           ]
           }
       },
       source_excludes="*_embedding, concatenated_text"
   )


   for hit in response["hits"]["hits"]:
       doc_id = hit["_id"]
       score = hit["_score"]
       year = hit["_source"]["Year"]
       event = hit["_source"]["Event"]
       games = hit["_source"]["Games"]
       sport = hit["_source"]["Sport"]
       city = hit["_source"]["City"]
       team = hit["_source"]["Team"]
       name = hit["_source"]["Name"]
       medal = hit["_source"]["Medal"]


       print(f"Score: {score}\nDocument ID: {doc_id}\nYear: {year}\nEvent: {event}\nName: {name}\nCity: {city}\nTeam: {team}\nMedal: {medal}\n")

The function above will receive a question about past Olympic games competition winners, performing a semantic search using Elastic’s text expansion query. The retrieved results are formatted and printed. Notice that we force the existence of medals in the query, as we are only interested in the winners. We also limited the size of the result to 3 as we expect three winners (gold, silver, bronze). Again, based on your use case, you might not necessarily do the same thing.

🏌️‍♂️ “Who won the Golf competition in 1900?”

Request:

semantic_search("Who won the Golf competition in 1900?")

Output:

Score: 18.184263
Document ID: 206566
Year: 1900
Event: Golf Men's Individual
Name: Walter Mathers Rutherford
City: Paris
Team: Great Britain
Medal: Silver

Score: 17.443663
Document ID: 209892
Year: 1900
Event: Golf Men's Individual
Name: Charles Edward Sands
City: Paris
Team: United States
Medal: Gold

Score: 16.939331
Document ID: 192747
Year: 1900
Event: Golf Women's Individual
Name: Myra Abigail "Abbie" Pratt (Pankhurst-, Wright-, -Karageorgevich)
City: Paris
Team: United States
Medal: Bronze

🏃‍♀️ “2004 Women's Marathon winners”

Request:

semantic_search("2004 Women's Marathon winners")

Output:

Score: 24.948284
Document ID: 168955
Year: 2004
Event: Athletics Women's Marathon
Name: Wincatherine Nyambura "Catherine" Ndereba
City: Athina
Team: Kenya
Medal: Silver

Score: 24.08922
Document ID: 58799
Year: 2004
Event: Athletics Women's Marathon
Name: Deena Michelle Drossin-Kastor
City: Athina
Team: United States
Medal: Bronze

Score: 21.391462
Document ID: 172670
Year: 2004
Event: Athletics Women's Marathon
Name: Mizuki Noguchi
City: Athina
Team: Japan
Medal: Gold

🏹 “Women archery winners of 1908”

Request:

semantic_search("Women archery winners of 1908")

Output:

Score: 21.876282
Document ID: 96010
Year: 1908
Event: Archery Women's Double National Round
Name: Beatrice Geraldine Hill-Lowe (Ruxton-, -Thompson)
City: London
Team: Great Britain
Medal: Bronze

Score: 21.0998
Document ID: 170250
Year: 1908
Event: Archery Women's Double National Round
Name: Sybil Fenton Newall
City: London
Team: Great Britain
Medal: Gold

Score: 21.079535
Document ID: 56686
Year: 1908
Event: Archery Women's Double National Round
Name: Charlotte "Lottie" Dod
City: London
Team: Great Britain
Medal: Silver

🚴‍♂️ “Who won the individual cycling competition in 1972?”

Request:

semantic_search("Who won the cycling competition in 1972?")

Output:

Score: 20.554308
Document ID: 215559
Year: 1972
Event: Cycling Men's Road Race, Individual
Name: Kevin "Clyde" Sefton
City: Munich
Team: Australia
Medal: Silver

Score: 20.267525
Document ID: 128598
Year: 1972
Event: Cycling Men's Road Race, Individual
Name: Hendrikus Andreas "Hennie" Kuiper
City: Munich
Team: Netherlands
Medal: Gold

Score: 19.108923
Document ID: 19225
Year: 1972
Event: Cycling Men's Team Pursuit, 4,000 metres
Name: Michael John "Mick" Bennett
City: Munich
Team: Great Britain
Medal: Bronze

Conclusion

This blog showed how you can perform semantic search with the Elastic Learned Sparse EncodeR (ELSER) NLP model, in Python programming language using Serverless. You will want to make sure you turn off severless after running this tutorial to avoid any extra charges. To go further, feel free to check out our Elasticsearch Relevance Engine (ESRE) Engineer course where you can learn how to leverage the Elasticsearch Relevance Engine (ESRE) and large language models (LLMs) to build advanced RAG (Retrieval-Augmented Generation) applications that combine the storage, processing, and search features of Elasticsearch with the generative power of an LLM.

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.

Ready to try this out on your own? Start a free trial.

Want to get Elastic certified? Find out when the next Elasticsearch Engineer training is running!

Ready to build state of the art search experiences?

Sufficiently advanced search isn’t achieved with the efforts of one. Elasticsearch is powered by data scientists, ML ops, engineers, and many more who are just as passionate about search as your are. Let’s connect and work together to build the magical search experience that will get you the results you want.

Try it yourself