Create inference API
editCreate inference API
editThis functionality is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.
Creates an inference endpoint to perform an inference task.
The inference APIs enable you to use certain services, such as built-in machine learning models (ELSER, E5), models uploaded through Eland, Cohere, OpenAI, or Hugging Face. For built-in models and models uploaded through Eland, the inference APIs offer an alternative way to use and manage trained models. However, if you do not plan to use the inference APIs to use these models or if you want to use non-NLP models, use the Machine learning trained model APIs.
Request
editPUT /_inference/<task_type>/<inference_id>
Prerequisites
edit-
Requires the
managecluster privilege.
Description
editThe create inference API enables you to create an inference endpoint and configure a machine learning model to perform a specific inference task.
The following services are available through the inference API:
- Cohere
- ELSER
- Hugging Face
- OpenAI
- Elasticsearch (for built-in models and models uploaded through Eland)
Path parameters
edit-
<inference_id> - (Required, string) The unique identifier of the inference endpoint.
-
<task_type> -
(Required, string) The type of the inference task that the model will perform. Available task types:
-
sparse_embedding, -
text_embedding.
-
Request body
edit-
service -
(Required, string) The type of service supported for the specified task type. Available services:
-
cohere: specify thetext_embeddingtask type to use the Cohere service. -
elser: specify thesparse_embeddingtask type to use the ELSER service. -
hugging_face: specify thetext_embeddingtask type to use the Hugging Face service. -
openai: specify thetext_embeddingtask type to use the OpenAI service. -
elasticsearch: specify thetext_embeddingtask type to use the E5 built-in model or text embedding models uploaded by Eland.
-
-
service_settings -
(Required, object) Settings used to install the inference model. These settings are specific to the
serviceyou specified.service_settingsfor thecohereservice-
api_key - (Required, string) A valid API key of your Cohere account. You can find your Cohere API keys or you can create a new one on the API keys settings page.
You need to provide the API key only once, during the inference model creation. The Get inference API does not retrieve your API key. After creating the inference model, you cannot change the associated API key. If you want to use a different API key, delete the inference model and recreate it with the same name and the updated API key.
-
embedding_type -
(Optional, string) Specifies the types of embeddings you want to get back. Defaults to
float. Valid values are:-
byte: use it for signed int8 embeddings (this is a synonym ofint8). -
float: use it for the default float embeddings. -
int8: use it for signed int8 embeddings.
-
-
model_id -
(Optional, string)
The name of the model to use for the inference task. To review the available
models, refer to the
Cohere docs. Defaults to
embed-english-v2.0.
service_settingsfor theelserservice-
num_allocations -
(Required, integer)
The number of model allocations to create.
num_allocationsmust not exceed the number of available processors per node divided by thenum_threads. -
num_threads -
(Required, integer)
The number of threads to use by each model allocation.
num_threadsmust not exceed the number of available processors per node divided by the number of allocations. Must be a power of 2. Max allowed value is 32.
service_settingsfor thehugging_faceservice-
api_key - (Required, string) A valid access token of your Hugging Face account. You can find your Hugging Face access tokens or you can create a new one on the settings page.
You need to provide the API key only once, during the inference model creation. The Get inference API does not retrieve your API key. After creating the inference model, you cannot change the associated API key. If you want to use a different API key, delete the inference model and recreate it with the same name and the updated API key.
-
url - (Required, string) The URL endpoint to use for the requests.
service_settingsfor theopenaiservice-
api_key - (Required, string) A valid API key of your OpenAI account. You can find your OpenAI API keys in your OpenAI account under the API keys section.
You need to provide the API key only once, during the inference model creation. The Get inference API does not retrieve your API key. After creating the inference model, you cannot change the associated API key. If you want to use a different API key, delete the inference model and recreate it with the same name and the updated API key.
-
dimensions -
(Optional, integer)
The number of dimensions the resulting output embeddings should have.
Only supported in
text-embedding-3and later models. If not set the OpenAI defined default for the model is used. -
model_id - (Required, string) The name of the model to use for the inference task. Refer to the OpenAI documentation for the list of available text embedding models.
-
organization_id - (Optional, string) The unique identifier of your organization. You can find the Organization ID in your OpenAI account under Settings > Organizations.
-
url -
(Optional, string)
The URL endpoint to use for the requests. Can be changed for testing purposes.
Defaults to
https://api.openai.com/v1/embeddings.
service_settingsfor theelasticsearchservice-
model_id -
(Required, string)
The name of the model to use for the inference task. It can be the
ID of either a built-in model (for example,
.multilingual-e5-smallfor E5) or a text embedding model already uploaded through Eland. -
num_allocations -
(Required, integer)
The number of model allocations to create.
num_allocationsmust not exceed the number of available processors per node divided by thenum_threads. -
num_threads -
(Required, integer)
The number of threads to use by each model allocation.
num_threadsmust not exceed the number of available processors per node divided by the number of allocations. Must be a power of 2. Max allowed value is 32.
-
-
task_settings -
(Optional, object) Settings to configure the inference task. These settings are specific to the
<task_type>you specified.task_settingsfor thetext_embeddingtask type-
input_type -
(optional, string) For
cohereservice only. Specifies the type of input passed to the model. Valid values are:-
classification: use it for embeddings passed through a text classifier. -
clusterning: use it for the embeddings run through a clustering algorithm. -
ingest: use it for storing document embeddings in a vector database. -
search: use it for storing embeddings of search queries run against a vector data base to find relevant documents.
-
-
truncate -
(Optional, string) For
cohereservice only. Specifies how the API handles inputs longer than the maximum token length. Defaults toEND. Valid values are:-
NONE: when the input exceeds the maximum input token length an error is returned. -
START: when the input exceeds the maximum input token length the start of the input is discarded. -
END: when the input exceeds the maximum input token length the end of the input is discarded.
-
-
Examples
editThis section contains example API calls for every service type.
Cohere service
editThe following example shows how to create an inference endpoint called
cohere_embeddings to perform a text_embedding task type.
resp = client.inference.put_model(
task_type="text_embedding",
inference_id="cohere-embeddings",
body={
"service": "cohere",
"service_settings": {
"api_key": "<api_key>",
"model_id": "embed-english-light-v3.0",
"embedding_type": "byte",
},
},
)
print(resp)
PUT _inference/text_embedding/cohere-embeddings
{
"service": "cohere",
"service_settings": {
"api_key": "<api_key>",
"model_id": "embed-english-light-v3.0",
"embedding_type": "byte"
}
}
E5 via the elasticsearch service
editThe following example shows how to create an inference endpoint called
my-e5-model to perform a text_embedding task type.
resp = client.inference.put_model(
task_type="text_embedding",
inference_id="my-e5-model",
body={
"service": "elasticsearch",
"service_settings": {
"num_allocations": 1,
"num_threads": 1,
"model_id": ".multilingual-e5-small",
},
},
)
print(resp)
PUT _inference/text_embedding/my-e5-model
{
"service": "elasticsearch",
"service_settings": {
"num_allocations": 1,
"num_threads": 1,
"model_id": ".multilingual-e5-small"
}
}
|
The |
ELSER service
editThe following example shows how to create an inference endpoint called
my-elser-model to perform a sparse_embedding task type.
resp = client.inference.put_model(
task_type="sparse_embedding",
inference_id="my-elser-model",
body={
"service": "elser",
"service_settings": {"num_allocations": 1, "num_threads": 1},
},
)
print(resp)
PUT _inference/sparse_embedding/my-elser-model
{
"service": "elser",
"service_settings": {
"num_allocations": 1,
"num_threads": 1
}
}
Example response:
{
"inference_id": "my-elser-model",
"task_type": "sparse_embedding",
"service": "elser",
"service_settings": {
"num_allocations": 1,
"num_threads": 1
},
"task_settings": {}
}
Hugging Face service
editThe following example shows how to create an inference endpoint called
hugging-face-embeddings to perform a text_embedding task type.
resp = client.inference.put_model(
task_type="text_embedding",
inference_id="hugging-face-embeddings",
body={
"service": "hugging_face",
"service_settings": {
"api_key": "<access_token>",
"url": "<url_endpoint>",
},
},
)
print(resp)
PUT _inference/text_embedding/hugging-face-embeddings
{
"service": "hugging_face",
"service_settings": {
"api_key": "<access_token>",
"url": "<url_endpoint>"
}
}
|
A valid Hugging Face access token. You can find on the settings page of your account. |
|
|
The inference endpoint URL you created on Hugging Face. |
Create a new inference endpoint on
the Hugging Face endpoint page to get an
endpoint URL. Select the model you want to use on the new endpoint creation page
- for example intfloat/e5-small-v2 - then select the Sentence Embeddings
task under the Advanced configuration section. Create the endpoint. Copy the URL
after the endpoint initialization has been finished.
The list of recommended models for the Hugging Face service:
Models uploaded by Eland via the elasticsearch service
editThe following example shows how to create an inference endpoint called
my-msmarco-minilm-model to perform a text_embedding task type.
resp = client.inference.put_model(
task_type="text_embedding",
inference_id="my-msmarco-minilm-model",
body={
"service": "elasticsearch",
"service_settings": {
"num_allocations": 1,
"num_threads": 1,
"model_id": "msmarco-MiniLM-L12-cos-v5",
},
},
)
print(resp)
PUT _inference/text_embedding/my-msmarco-minilm-model
{
"service": "elasticsearch",
"service_settings": {
"num_allocations": 1,
"num_threads": 1,
"model_id": "msmarco-MiniLM-L12-cos-v5"
}
}
|
The |
OpenAI service
editThe following example shows how to create an inference endpoint called openai-embeddings to perform a text_embedding task type.
The embeddings created by requests to this endpoint will have 128 dimensions.
PUT _inference/text_embedding/openai_embeddings
{
"service": "openai",
"service_settings": {
"api_key": "<api_key>",
"model_id": "text-embedding-3-small",
"dimensions": 128
}
}