This tutorial shows you how to compute embeddings with Cohere using the inference API and store them for efficient vector or hybrid search in Elasticsearch. This tutorial uses the Python Elasticsearch client to perform the operations.
You'll learn how to:
- create an inference endpoint for text embedding using the Cohere service,
- create the necessary index mapping for the Elasticsearch index,
- build an inference pipeline to ingest documents into the index together with the embeddings,
- perform hybrid search on the data,
- rerank search results by using Cohere's rerank model,
- design a RAG system with Cohere's Chat API.
The tutorial uses the SciFact data set.
Refer to Cohere's tutorial for an example using a different data set.
🧰 Requirements
For this example, you will need:
-
An Elastic deployment with minimum 4GB machine learning node
- We'll be using Elastic Cloud for this example (available with a free trial)
-
A paid Cohere account is required to use the Inference API with the Cohere service as the Cohere free trial API usage is limited.
-
Python 3.7 or later.
Install and import required packages
Install Elasticsearch and Cohere:
Import the required packages:
Create an Elasticsearch client
Now you can instantiate the Python Elasticsearch client.
First provide your password and Cloud ID.
Then create a client object that instantiates an instance of the Elasticsearch class.
Create the inference endpoint
Create the inference endpoint first. In this example, the inference endpoint
uses Cohere's embed-english-v3.0 model and the embedding_type is set to
byte.
You can find your API keys in your Cohere dashboard under the API keys section.
Create the index mapping
Create the index mapping for the index that will contain the embeddings.
Create the inference pipeline
Now you have an inference endpoint and an index ready to store embeddings. The next step is to create an ingest pipeline that creates the embeddings using the inference endpoint and stores them in the index.
Prepare data and insert documents
This example uses the SciFact data set that you can find on HuggingFace.
Your index is populated with the SciFact data and text embeddings for the text field.
Hybrid search
Let's start querying the index!
The code below performs a hybrid search. The kNN query computes the relevance
of search results based on vector similarity using the text_embedding field.
The lexical search query uses BM25 retrieval to compute keyword similarity on
the title and text fields.
Rerank search results
To combine the results more effectively, use Cohere's Rerank v3 model through the inference API to provide a more precise semantic reranking of the results.
Create an inference endpoint with your Cohere API key and the used model name as
the model_id (rerank-english-v3.0 in this example).
Rerank the results using the new inference endpoint.
Retrieval Augmented Generation (RAG) with Cohere and Elasticsearch
RAG is a method for generating text using additional information fetched from an external data source. With the ranked results, you can build a RAG system on top of what you created with Cohere's Chat API.
Pass in the retrieved documents and the query to receive a grounded response using Cohere's newest generative model Command R+.
Then pass in the query and the documents to the Chat API, and print out the response.