Loading

Ingest data with semantic_text fields

Serverless Stack 9.0.0

This page provides instructions for ingesting data into semantic_text fields. Learn how to index pre-chunked content, use copy_to and multi-fields to collect values from multiple fields, and perform updates and partial updates to optimize ingestion costs.

Stack 9.1.0

To index pre-chunked content, provide your text as an array of strings. Each element in the array represents a single chunk that will be sent directly to the inference service without further chunking.

  1. Disable automatic chunking

    Disable automatic chunking in your index mapping by setting chunking_settings.strategy to none:

    				PUT test-index
    					{
      "mappings": {
        "properties": {
          "my_semantic_field": {
            "type": "semantic_text",
            "chunking_settings": {
              "strategy": "none"
            }
          }
        }
      }
    }
    		
    1. Disables automatic chunking on my_semantic_field.
  2. Index documents

    Index documents with pre-chunked text as an array:

    				PUT test-index/_doc/1
    					{
        "my_semantic_field": ["my first chunk", "my second chunk", ...]
        ...
    }
    		
    1. The text is pre-chunked and provided as an array of strings. Each element represents a single chunk.
Important

When providing pre-chunked input:

  • Ensure that you set the chunking strategy to none to avoid additional processing.
  • Size each chunk carefully, staying within the token limit of the inference service and the underlying model.
  • If a chunk exceeds the model's token limit, the behavior depends on the service:
    • Some services (such as OpenAI) will return an error.
    • Others (such as elastic and elasticsearch) will automatically truncate the input.

You can use a single semantic_text field to collect values from multiple fields for semantic search. The semantic_text field type can serve as the target of copy_to fields, be part of a multi-field structure, or contain multi-fields internally.

Use copy_to to copy values from source fields to a semantic_text field:

				PUT test-index
					{
    "mappings": {
        "properties": {
            "source_field": {
                "type": "text",
                "copy_to": "infer_field"
            },
            "infer_field": {
                "type": "semantic_text",
                "inference_id": ".elser-2-elasticsearch"
            }
        }
    }
}
		

Declare semantic_text as a multi-field:

				PUT test-index
					{
    "mappings": {
        "properties": {
            "source_field": {
                "type": "text",
                "fields": {
                    "infer_field": {
                        "type": "semantic_text",
                        "inference_id": ".elser-2-elasticsearch"
                    }
                }
            }
        }
    }
}
		

When updating documents that contain semantic_text fields, it's important to understand how inference is triggered:

Full document updates
Full document updates re-run inference on all semantic_text fields, even if their values did not change. This ensures that embeddings remain consistent with the current document state but can increase ingestion costs.
Partial updates using the Bulk API
Partial updates submitted through the Bulk API reuse existing embeddings when you omit semantic_text fields. inference does not run for omitted fields, which can significantly reduce processing time and cost.
Partial updates using the Update API
Partial updates submitted through the Update API re-run inference on all semantic_text fields, even when you omit them from the doc object. Embeddings are re-generated regardless of whether field values changed.

To preserve existing embeddings and avoid unnecessary inference costs:

  • Use partial updates with the Bulk API.
  • Omit any semantic_text fields that did not change from the doc object in your request.

For indices containing semantic_text fields, updates that use scripts have the following behavior:

  • Supported: Update API
  • Not supported: Bulk API. Scripted updates will fail even if the script targets non-semantic_text fields.