We’ve integrated Azure OpenAI chat completions in the inference API, which allows our customers to build powerful GenAI applications based on chat completion using large language models like GPT-4 Azure and Elasticsearch developers can utilize the unique capabilities of the Elasticsearch vector database and the Azure AI ecosystem to power unique GenAI applications with the model of their choice.
This blog quickly goes over the catalog of supported providers in the open inference API and explains how to use Azure’s OpenAI chat completions to answer questions through an example.
The inference API is growing…fast!
We’re heavily extending the catalog of supported providers in the open inference API. Check out some of our latest blog posts on Elastic Search labs to learn more about recent integrations around embeddings, completions and reranking:
- Elasticsearch open inference API adds support for Azure Open AI Studio
- Elasticsearch open inference API adds support for Azure Open AI embeddings
- Elasticsearch open inference API adds support for OpenAI chat completions
- Elasticsearch open Inference API adds support for Cohere’s Rerank 3 model
- Elasticsearch open inference API adds support for Cohere Embeddings
- ...more to come!
Azure OpenAI chat completions support is available through the open inference API in our stateless offering on Elastic Cloud. It’ll also be soon available to everyone in an upcoming versioned Elasticsearch release. This also complements the capability to use the Elasticsearch vector database in the Azure OpenAI service.
Using Azure’s OpenAI chat completions to answer questions
In my last blog post about OpenAI chat completions we’ve learned how to summarize text using OpenAI’s chat completions. In this guide we’ll use Azure OpenAI chat completions to answer questions during ingestion to have answers ready ahead of searching. Make sure you have your Azure OpenAI api key, deployment id and resource name ready by creating a free Azure account first and setting up a model suited for chat completions. You can follow Azure's OpenAI Service GPT quickstart guide to get a model up and running. In the following example we’ve used `gpt-4` with the version `2024-02-01`. You can read more about supported models and versions here.
In Kibana, you'll have access to a console for you to input these next steps in Elasticsearch without even needing to set up an IDE.
First, we configure a model, which will perform completions:
PUT _inference/completion/azure_openai_completion
{
"service": "azureopenai",
"service_settings": {
"resource_name":"<resource-name>",
"deployment_id": "<deployment-id>",
"api_version": "2024-02-01",
"api_key": "<api-key>"
}
}
You’ll get back a response similar to the following with status code `200 OK` on successful inference creation:
{
"model_id": "azure_openai_completion",
"task_type": "completion",
"service": "azureopenai",
"service_settings": {
"resource_name": "<resource-name>",
"deployment_id": "<deployment-id>",
"api_version": "2024-02-01"
},
"task_settings": {}
}
You can now call the configured model to perform completion on any text input. Let’s ask the model what’s inference in the context of GenAI:
POST _inference/completion/azure_openai_completion
{
"input": "What is inference in the context of GenAI?"
}
You should get back a response with status code `200 OK` explaining what inference is:
{
"completion": [
{
"result": "In the context of generative AI, inference refers to the process of generating new data based on the patterns, structures, and relationships the AI has learned from the training data. It involves using a model that has been trained on a lot of data to infer or generate new, similar data. For instance, a generative AI model trained on a collection of paintings might infer or generate new, similar paintings. This is the useful part of machine learning where the actual task is performed."
}
]
}
Now we can set up a small catalog of questions, which we want to be answered during ingestion. We’ll use the Bulk API to index three questions about products of Elastic:
POST _bulk
{ "index" : { "_index" : "questions" } }
{"question": "What is Elasticsearch?"}
{ "index" : { "_index" : "questions" } }
{"question": "What is Kibana?"}
{ "index" : { "_index" : "questions" } }
{"question": "What is Logstash?"}
You’ll get back a response with status `200 OK` back similar to the following upon successful indexing:
{
"errors": false,
"took": 385,
"items": [
{
"index": {
"_index": "questions",
"_id": "4RO6YY8Bv2OsAP2iNusn",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 0,
"_primary_term": 1,
"status": 201
}
},
{
"index": {
"_index": "questions",
"_id": "4hO6YY8Bv2OsAP2iNuso",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 1,
"_primary_term": 1,
"status": 201
}
},
{
"index": {
"_index": "questions",
"_id": "4xO6YY8Bv2OsAP2iNuso",
"_version": 1,
"result": "created",
"_shards": {
"total": 2,
"successful": 1,
"failed": 0
},
"_seq_no": 2,
"_primary_term": 1,
"status": 201
}
}
]
}
We’ll create now our question and answering ingest pipeline using the script-, inference- and remove-processor:
PUT _ingest/pipeline/question_answering_pipeline
{
"processors": [
{
"script": {
"source": "ctx.prompt = 'Please answer the following question: ' + ctx.question"
}
},
{
"inference": {
"model_id": "azure_openai_completion",
"input_output": {
"input_field": "prompt",
"output_field": "answer"
}
}
},
{
"remove": {
"field": "prompt"
}
}
]
}
This pipeline prefixes the content with the instruction “Please answer the following question: “ in a temporary field named `prompt`. The content of this temporary `prompt` field will be sent to Azure’s OpenAI Service through the inference API to perform a completion. Using an ingest pipeline allows for immense flexibility as you can change the pre-prompt to anything you would like. This allows you to summarize documents for example, too. Check out Elasticsearch open inference API adds support for OpenAI chat completions to learn about how to build a summarisation ingest pipeline!
We now send our documents containing questions through the question and answering pipeline by calling the reindex API.
POST _reindex
{
"source": {
"index": "questions",
"size": 50
},
"dest": {
"index": "answers",
"pipeline": "question_answering_pipeline"
}
}
You'll get back a response with status 200 OK
similar to the following:
{
"took": 10651,
"timed_out": false,
"total": 3,
"updated": 0,
"created": 3,
"deleted": 0,
"batches": 1,
"version_conflicts": 0,
"noops": 0,
"retries": {
"bulk": 0,
"search": 0
},
"throttled_millis": 0,
"requests_per_second": -1.0,
"throttled_until_millis": 0,
"failures": []
}
In a real world setup you’ll probably use another ingestion mechanism to ingest your documents in an automated way. Check out our Adding data to Elasticsearch guide to learn more about the various options offered by Elastic to ingest data into Elasticsearch. We’re also committed to showcase ingest mechanisms and provide guidance on how to bring data into Elasticsearch using 3rd party tools. Take a look at Ingest Data from Snowflake to Elasticsearch using Meltano: A developer’s journey for example on how to use Meltano for ingesting data.
You're now able to search for your pre-generated answers using the Search API:
POST answers/_search
{
"query": {
"match_all": { }
}
}
In the response you'll get back your pre-generated answers:
{
"took": 11,
"timed_out": false,
"_shards": { ... },
"hits": {
"total": { ... },
"max_score": 1.0,
"hits": [
{
"_index": "answers",
"_id": "4RO6YY8Bv2OsAP2iNusn",
"_score": 1.0,
"_ignored": [
"answer.keyword"
],
"_source": {
"model_id": "azure_openai_completion",
"question": "What is Elasticsearch?",
"answer": "Elasticsearch is an open-source, RESTful, distributed search and analytics engine built on Apache Lucene. It can handle a wide variety of data types, including textual, numerical, geospatial, structured, and unstructured data. Elasticsearch is scalable and designed to operate in real-time, making it an ideal choice for use cases such as application search, log and event data analysis, and anomaly detection."
}
},
{ ... },
{ ... }
]
}
}
Pre-generating answers for frequently asked questions is particularly effective in reducing operational costs. By minimizing the need for on-the-fly response generation, you can significantly cut down on the amount of computational resources required like token usage. Additionally, this method ensures that every user receives the same, precise information. Consistency is crucial, especially in fields requiring high reliability and accuracy such as medical, legal, or technical support.
More to come!
We’re already working on adding support for more task types using Cohere, Google Vertex AI and many more. Furthermore we’re actively developing an intuitive UI in Kibana for managing Inference endpoints. Lots of exciting stuff to come! Bookmark Elastic Search Labs now to keep with Elastic’s innovations in the GenAI space!
Ready to try this out on your own? Start a free trial.
Elasticsearch has integrations for tools from LangChain, Cohere and more. Join our advanced semantic search webinar to build your next GenAI app!