Start trained model deployment API
editStart trained model deployment API
editStarts a new trained model deployment.
This functionality is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.
Request
editPOST _ml/trained_models/<model_id>/deployment/_start
Prerequisites
editRequires the manage_ml cluster privilege. This privilege is included in the
machine_learning_admin built-in role.
Description
editCurrently only pytorch models are supported for deployment. When deployed,
the model attempts allocation to every machine learning node. Once deployed
the model can be used by the Inference processor
in an ingest pipeline or directly in the Infer trained model API.
Path parameters
edit-
<model_id> - (Required, string) The unique identifier of the trained model.
Query parameters
edit-
number_of_allocations -
(Optional, integer) The number of model allocations on each node where the model is deployed. All allocations on a node share the same copy of the model in memory but use a separate set of threads to evaluate the model. Increasing this value generally increases the throughput. If this setting is greater than the number of hardware threads it will automatically be changed to a value less than the number of hardware threads. Defaults to 1.
If the sum of
threads_per_allocationandnumber_of_allocationsis greater than the number of hardware threads, thethreads_per_allocationvalue is reduced. -
queue_capacity - (Optional, integer) Controls how many inference requests are allowed in the queue at a time. Every machine learning node in the cluster where the model can be allocated has a queue of this size; when the number of requests exceeds the total value, new requests are rejected with a 429 error. Defaults to 1024.
-
threads_per_allocation - (Optional, integer) Sets the number of threads used by each model allocation during inference. This generally increases the inference speed. The inference process is a compute-bound process; any number greater than the number of available hardware threads on the machine does not increase the inference speed. If this setting is greater than the number of hardware threads it will automatically be changed to a value less than the number of hardware threads. Defaults to 1. Must be a power of 2. Max allowed value is 32.
-
timeout - (Optional, time) Controls the amount of time to wait for the model to deploy. Defaults to 20 seconds.
-
wait_for -
(Optional, string)
Specifies the allocation status to wait for before returning. Defaults to
started. The valuestartingindicates deployment is starting but not yet on any node. The valuestartedindicates the model has started on at least one node. The valuefully_allocatedindicates the deployment has started on all valid nodes.
Examples
editThe following example starts a new deployment for a
elastic__distilbert-base-uncased-finetuned-conll03-english trained model:
POST _ml/trained_models/elastic__distilbert-base-uncased-finetuned-conll03-english/deployment/_start?wait_for=started&timeout=1m
The API returns the following results:
{
"assignment": {
"task_parameters": {
"model_id": "elastic__distilbert-base-uncased-finetuned-conll03-english",
"model_bytes": 265632637,
"threads_per_allocation" : 1,
"number_of_allocations" : 1,
"queue_capacity" : 1024
},
"routing_table": {
"uckeG3R8TLe2MMNBQ6AGrw": {
"routing_state": "started",
"reason": ""
}
},
"assignment_state": "started",
"start_time": "2022-11-02T11:50:34.766591Z"
}
}