Elasticsearch Guide: other versions:
What is Elasticsearch?
- Data in: documents and indices
- Information out: search and analyze
- Scalability and resilience
What’s new in 7.7
Getting started with Elasticsearch
- Get Elasticsearch up and running
- Index some documents
- Start searching
- Analyze results with aggregations
- Where to go from here
Set up Elasticsearch
- Installing Elasticsearch
- Configuring Elasticsearch
- Important Elasticsearch configuration
- Important System Configuration
- Bootstrap Checks
- Bootstrap Checks for X-Pack
- Starting Elasticsearch
- Stopping Elasticsearch
- Discovery and cluster formation
- Add and remove nodes in your cluster
- Full-cluster restart and rolling restart
- Remote clusters
- Set up X-Pack
- Configuring X-Pack Java Clients
- Plugins
Upgrade Elasticsearch
- Rolling upgrades
- Full cluster restart upgrade
- Reindex before upgrading
  - Reindex in place
  - Reindex from a remote cluster
Search your data
- Run a search
- Near real-time search
- Long-running searches
- Search across clusters
Query DSL
- Query and filter context
- Compound queries
- Full text queries
- Geo queries
- Shape queries
  - Shape
- Joining queries
  - Nested
  - Has child
  - Has parent
  - Parent ID
  - Notes
- Match all
- Span queries
- Specialized queries
- Term-level queries
  - Exists
  - Fuzzy
  - IDs
  - Prefix
  - Range
  - Regexp
  - Term
  - Terms
  - Terms set
  - Type Query
  - Wildcard
- minimum_should_match parameter
- rewrite parameter
- Regular expression syntax
SQL access
- Overview
- Getting Started with SQL
- Conventions and Terminology
  - Mapping concepts across SQL and Elasticsearch
- Security
- SQL REST API
- SQL Translate API
- SQL CLI
- SQL JDBC
  - API usage
- SQL ODBC
  - Driver installation
  - Configuration
- SQL Client Applications
- SQL Language
- Functions and Operators
- Reserved keywords
- SQL Limitations
Aggregations
- Metrics Aggregations
- Bucket Aggregations
- Pipeline Aggregations
- Matrix Aggregations
  - Matrix Stats
- Caching heavy aggregations
- Returning only aggregation results
- Aggregation Metadata
- Returning the type of the aggregation
- Indexing aggregation results with transforms
Scripting
- How to use scripts
- Accessing document fields and special variables
- Scripting and security
- Painless scripting language
- Lucene expressions language
- Advanced scripts using script engines
Mapping
- Removal of mapping types
- Field datatypes
  - Alias
  - Arrays
  - Binary
  - Boolean
  - Date
  - Date nanoseconds
  - Dense vector
  - Histogram
  - Flattened
  - Geo-point
  - Geo-shape
  - IP
  - Join
  - Keyword
  - Nested
  - Numeric
  - Object
  - Percolator
  - Range
  - Rank feature
  - Rank features
  - Search-as-you-type
  - Sparse vector
  - Text
  - Token count
  - Shape
  - Constant keyword
- Meta-Fields
- Mapping parameters
- Dynamic Mapping
  - Dynamic field mapping
  - Dynamic templates
Text analysis
- Overview
- Concepts
- Configure text analysis
- Built-in analyzer reference
- Tokenizer reference
- Token filter reference
- Character filters reference
- Normalizers
Index modules
- Analysis
- Index Shard Allocation
- Mapper
- Merge
- Similarity module
- Slow Log
- Store
  - Preloading data into the file system cache
- Translog
- History retention
- Index Sorting
  - Use index sorting to speed up conjunctions
Ingest node
- Pipeline Definition
- Accessing Data in Pipelines
- Conditional Execution in Pipelines
- Handling Failures in Pipelines
- Enrich your data
- Processors
ILM: Manage the index lifecycle
- Overview
- Concepts
- Automate rollover
- Index lifecycle actions
  - Allocate
  - Delete
  - Force merge
  - Freeze
  - Read only
  - Rollover
  - Set priority
  - Shrink
  - Unfollow
  - Wait for snapshot
- Configure a lifecycle policy
- Resolve lifecycle policy execution errors
- Start and stop index lifecycle management
- Manage existing indices
- Skip rollover
- Restore a managed index
Monitor a cluster
- Overview
- How it works
- Monitoring in a production environment
- Collecting monitoring data with Metricbeat
- Collecting log data with Filebeat
- Configuring indices for monitoring
- Legacy collection methods
- Troubleshooting
Frozen indices
- Best practices
- Searching a frozen index
- Monitoring frozen indices
Roll up or transform your data
- Rolling up historical data
- Transforming data
Set up a cluster for high availability
- Designing for resilience
  - Resilience in small clusters
  - Resilience in larger clusters
- Back up a cluster
- Cross-cluster replication
Snapshot and restore
- Register repository
- Take a snapshot
- Restore a snapshot
- Monitor snapshot and restore
- SLM: Manage the snapshot lifecycle
Secure a cluster
- Overview
- Configuring security
- User authentication
- Configuring SAML single-sign-on on the Elastic Stack
- Configuring single sign-on to the Elastic Stack using OpenID Connect
- User authorization
- Enabling audit logging
- Encrypting communications
- Restricting connections with IP filtering
- Cross cluster search, clients, and integrations
- Tutorial: Getting started with security
- Tutorial: Encrypting communications
- Troubleshooting
- Limitations
Alerting on cluster and index events
- Getting started with Watcher
- How Watcher works
- Encrypting sensitive data in Watcher
- Inputs
- Triggers
  - Schedule trigger
- Conditions
- Actions
- Payload transforms
- Java API
- Managing watches
- Example watches
  - Watching the status of an Elasticsearch cluster
  - Watching event data
- Troubleshooting
- Limitations
Command line tools
- elasticsearch-certgen
- elasticsearch-certutil
- elasticsearch-croneval
  - Parameters
- elasticsearch-keystore
- elasticsearch-migrate
- elasticsearch-node
- elasticsearch-saml-metadata
- elasticsearch-setup-passwords
- elasticsearch-shard
- elasticsearch-syskeygen
- elasticsearch-users
How To
- General recommendations
- Recipes
- Tune for indexing speed
- Tune for search speed
  - Tune your queries with the Profile API
  - Faster phrase queries with index_phrases
  - Faster prefix queries with index_prefixes
  - Use constant_keyword to speed up filtering
- Tune for disk usage
- Avoid oversharding
Glossary of terms
REST APIs
- API conventions
  - Multiple indices
  - Date math support in index names
  - Cron expressions
  - Common options
  - URL-based access control
- cat APIs
  - cat aliases
  - cat allocation
  - cat anomaly detectors
  - cat count
  - cat data frame analytics
  - cat datafeeds
  - cat fielddata
  - cat health
  - cat indices
  - cat master
  - cat nodeattrs
  - cat nodes
  - cat pending tasks
  - cat plugins
  - cat recovery
  - cat repositories
  - cat shards
  - cat segments
  - cat snapshots
  - cat task management
  - cat templates
  - cat thread pool
  - cat trained model
  - cat transforms
- Cluster APIs
  - Cluster allocation explain
  - Cluster get settings
  - Cluster health
  - Cluster reroute
  - Cluster state
  - Cluster stats
  - Cluster update settings
  - Nodes feature usage
  - Nodes hot threads
  - Nodes info
  - Nodes reload secure settings
  - Nodes stats
  - Pending cluster tasks
  - Remote cluster info
  - Task management
  - Voting configuration exclusions
- Cross-cluster replication APIs
  - Get CCR stats
  - Create follower
  - Pause follower
  - Resume follower
  - Unfollow
  - Forget follower
  - Get follower stats
  - Get follower info
  - Create auto-follow pattern
  - Delete auto-follow pattern
  - Get auto-follow pattern
  - Pause auto-follow pattern
  - Resume auto-follow pattern
- Document APIs
  - Reading and Writing documents
  - Index
  - Get
  - Delete
  - Delete by query
  - Update
  - Update by query API
  - Multi get
  - Bulk
  - Reindex
  - Term vectors
  - Multi term vectors
  - ?refresh
  - Optimistic concurrency control
- Enrich APIs
  - Put enrich policy
  - Delete enrich policy
  - Get enrich policy
  - Execute enrich policy
  - Enrich stats
- Explore API
- Index APIs
  - Add index alias
  - Analyze
  - Clear cache
  - Clone index
  - Close index
  - Create index
  - Delete index
  - Delete index alias
  - Delete index template
  - Flush
  - Force merge
  - Freeze index
  - Get field mapping
  - Get index
  - Get index alias
  - Get index settings
  - Get index template
  - Get mapping
  - Index alias exists
  - Index exists
  - Index recovery
  - Index segments
  - Index shard stores
  - Index stats
  - Index template exists
  - Open index
  - Put index template
  - Put mapping
  - Refresh
  - Rollover index
  - Shrink index
  - Split index
  - Synced flush
  - Type exists
  - Unfreeze index
  - Update index alias
  - Update index settings
- Index lifecycle management API
  - Create policy
  - Get policy
  - Delete policy
  - Move to step
  - Remove policy
  - Retry policy
  - Get index lifecycle management status
  - Explain lifecycle
  - Start index lifecycle management
  - Stop index lifecycle management
- Ingest APIs
  - Put pipeline
  - Get pipeline
  - Delete pipeline
  - Simulate pipeline
- Info API
- Licensing APIs
  - Delete license
  - Get license
  - Get trial status
  - Start trial
  - Get basic status
  - Start basic
  - Update license
- Machine learning anomaly detection APIs
  - Add events to calendar
  - Add jobs to calendar
  - Close jobs
  - Create jobs
  - Create calendar
  - Create datafeeds
  - Create filter
  - Delete calendar
  - Delete datafeeds
  - Delete events from calendar
  - Delete filter
  - Delete forecast
  - Delete jobs
  - Delete jobs from calendar
  - Delete model snapshots
  - Delete expired data
  - Estimate model memory
  - Find file structure
  - Flush jobs
  - Forecast jobs
  - Get buckets
  - Get calendars
  - Get categories
  - Get datafeeds
  - Get datafeed statistics
  - Get influencers
  - Get jobs
  - Get job statistics
  - Get machine learning info
  - Get model snapshots
  - Get overall buckets
  - Get scheduled events
  - Get filters
  - Get records
  - Open jobs
  - Post data to jobs
  - Preview datafeeds
  - Revert model snapshots
  - Set upgrade mode
  - Start datafeeds
  - Stop datafeeds
  - Update datafeeds
  - Update filter
  - Update jobs
  - Update model snapshots
- Machine learning data frame analytics APIs
  - Create data frame analytics jobs
  - Create inference trained model
  - Delete data frame analytics jobs
  - Delete inference trained model
  - Evaluate data frame analytics
  - Explain data frame analytics API
  - Get data frame analytics jobs
  - Get data frame analytics jobs stats
  - Get inference trained model
  - Get inference trained model stats
  - Start data frame analytics jobs
  - Stop data frame analytics jobs
- Migration APIs
  - Deprecation info
- Reload search analyzers
- Rollup APIs
  - Create rollup jobs
  - Delete rollup jobs
  - Get job
  - Get rollup caps
  - Get rollup index caps
  - Rollup search
  - Start rollup jobs
  - Stop rollup jobs
- Search APIs
  - Search
  - Request Body Search
  - Async search
  - Scroll
  - Clear scroll
  - Search Template
  - Multi Search Template
  - Search Shards API
  - Suggesters
  - Multi Search API
  - Count API
  - Validate API
  - Explain API
  - Profile API
  - Field Capabilities API
  - Ranking Evaluation API
- Security APIs
  - Authenticate
  - Change passwords
  - Clear cache
  - Clear roles cache
  - Create API keys
  - Create or update application privileges
  - Create or update role mappings
  - Create or update roles
  - Create or update users
  - Delegate PKI authentication
  - Delete application privileges
  - Delete role mappings
  - Delete roles
  - Delete users
  - Disable users
  - Enable users
  - Get API key information
  - Get application privileges
  - Get builtin privileges
  - Get role mappings
  - Get roles
  - Get token
  - Get users
  - Has privileges
  - Invalidate API key
  - Invalidate token
  - OpenID Connect Prepare Authentication API
  - OpenID Connect authenticate API
  - OpenID Connect logout API
  - SAML prepare authentication API
  - SAML authenticate API
  - SAML logout API
  - SAML invalidate API
  - SSL certificate
- Snapshot and restore APIs
  - Clean up snapshot repository
  - Delete snapshot repository
  - Get snapshot repository
  - Put snapshot repository
  - Verify snapshot repository
- Snapshot lifecycle management API
  - Put policy
  - Get policy
  - Delete policy
  - Execute snapshot lifecycle policy
  - Execute snapshot retention policy
  - Get snapshot lifecycle management status
  - Get snapshot lifecycle stats
  - Start snapshot lifecycle management
  - Stop snapshot lifecycle management
- Transform APIs
  - Create transform
  - Delete transform
  - Get transforms
  - Get transform statistics
  - Preview transform
  - Start transform
  - Stop transforms
  - Update transform
- Usage API
- Watcher APIs
  - Ack watch
  - Activate watch
  - Deactivate watch
  - Delete watch
  - Execute watch
  - Get watch
  - Get Watcher stats
  - Put watch
  - Start watch service
  - Stop watch service
- Definitions
  - Role mapping resources
Breaking changes
- 7.7
- 7.6
- 7.5
- 7.4
- 7.3
- 7.2
- 7.1
- 7.0
  - Java time migration guide
Release notes
- Elasticsearch version 7.7.1
- Elasticsearch version 7.7.0
- Elasticsearch version 7.6.2
- Elasticsearch version 7.6.1
- Elasticsearch version 7.6.0
- Elasticsearch version 7.5.2
- Elasticsearch version 7.5.1
- Elasticsearch version 7.5.0
- Elasticsearch version 7.4.2
- Elasticsearch version 7.4.1
- Elasticsearch version 7.4.0
- Elasticsearch version 7.3.2
- Elasticsearch version 7.3.1
- Elasticsearch version 7.3.0
- Elasticsearch version 7.2.1
- Elasticsearch version 7.2.0
- Elasticsearch version 7.1.1
- Elasticsearch version 7.1.0
- Elasticsearch version 7.0.0
- Elasticsearch version 7.0.0-rc2
- Elasticsearch version 7.0.0-rc1
- Elasticsearch version 7.0.0-beta1
- Elasticsearch version 7.0.0-alpha2
- Elasticsearch version 7.0.0-alpha1

IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

› › ›

Request Body Search

edit

Request Body Search

edit

Specifies search criteria as request body parameters.

GET /twitter/_search
{
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Copy as curl Try in Elastic

Request

edit

GET /<index>/_search { "query": {<parameters>} }

Description

edit

The search request can be executed with a search DSL, which includes the Query DSL, within its body.

Path parameters

edit

<index>: (Optional, string) Comma-separated list or wildcard expression of index names used to limit the request.

Request body

edit

See the search API’s request body parameters.

Fast check for any matching docs

edit

terminate_after is always applied after the post_filter and stops the query as well as the aggregation executions when enough hits have been collected on the shard. Though the doc count on aggregations may not reflect the hits.total in the response since aggregations are applied before the post filtering.

In case we only want to know if there are any documents matching a specific query, we can set the size to 0 to indicate that we are not interested in the search results. Also we can set terminate_after to 1 to indicate that the query execution can be terminated whenever the first matching document was found (per shard).

GET /_search?q=message:number&size=0&terminate_after=1

Copy as curl Try in Elastic

The response will not contain any hits as the size was set to 0. The hits.total will be either equal to 0, indicating that there were no matching documents, or greater than 0 meaning that there were at least as many documents matching the query when it was early terminated. Also if the query was terminated early, the terminated_early flag will be set to true in the response.

{
  "took": 3,
  "timed_out": false,
  "terminated_early": true,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped" : 0,
    "failed": 0
  },
  "hits": {
    "total" : {
        "value": 1,
        "relation": "eq"
    },
    "max_score": null,
    "hits": []
  }
}

The took time in the response contains the milliseconds that this request took for processing, beginning quickly after the node received the query, up until all search related work is done and before the above JSON is returned to the client. This means it includes the time spent waiting in thread pools, executing a distributed search across the whole cluster and gathering all the results.

Doc value fields

edit

See doc value fields.

Field Collapsing

edit

Allows to collapse search results based on field values. The collapsing is done by selecting only the top sorted document per collapse key. For instance the query below retrieves the best tweet for each user and sorts them by number of likes.

GET /twitter/_search
{
    "query": {
        "match": {
            "message": "elasticsearch"
        }
    },
    "collapse" : {
        "field" : "user" 
    },
    "sort": ["likes"], 
    "from": 10 
}

Copy as curl Try in Elastic

	collapse the result set using the "user" field
	sort the top docs by number of likes
	define the offset of the first collapsed result

The total number of hits in the response indicates the number of matching documents without collapsing. The total number of distinct group is unknown.

The field used for collapsing must be a single valued keyword or numeric field with doc_values activated

The collapsing is applied to the top hits only and does not affect aggregations.

Expand collapse results

edit

It is also possible to expand each collapsed top hits with the inner_hits option.

GET /twitter/_search
{
    "query": {
        "match": {
            "message": "elasticsearch"
        }
    },
    "collapse" : {
        "field" : "user", 
        "inner_hits": {
            "name": "last_tweets", 
            "size": 5, 
            "sort": [{ "date": "asc" }] 
        },
        "max_concurrent_group_searches": 4 
    },
    "sort": ["likes"]
}

Copy as curl Try in Elastic

	collapse the result set using the "user" field
	the name used for the inner hit section in the response
	the number of inner_hits to retrieve per collapse key
	how to sort the document inside each group
	the number of concurrent requests allowed to retrieve the inner_hits` per group

See inner hits for the complete list of supported options and the format of the response.

It is also possible to request multiple inner_hits for each collapsed hit. This can be useful when you want to get multiple representations of the collapsed hits.

GET /twitter/_search
{
    "query": {
        "match": {
            "message": "elasticsearch"
        }
    },
    "collapse" : {
        "field" : "user", 
        "inner_hits": [
            {
                "name": "most_liked",  
                "size": 3,
                "sort": ["likes"]
            },
            {
                "name": "most_recent", 
                "size": 3,
                "sort": [{ "date": "asc" }]
            }
        ]
    },
    "sort": ["likes"]
}

Copy as curl Try in Elastic

	collapse the result set using the "user" field
	return the three most liked tweets for the user
	return the three most recent tweets for the user

The expansion of the group is done by sending an additional query for each inner_hit request for each collapsed hit returned in the response. This can significantly slow things down if you have too many groups and/or inner_hit requests.

The max_concurrent_group_searches request parameter can be used to control the maximum number of concurrent searches allowed in this phase. The default is based on the number of data nodes and the default search thread pool size.

collapse cannot be used in conjunction with scroll, rescore or search after.

Second level of collapsing

edit

Second level of collapsing is also supported and is applied to inner_hits. For example, the following request finds the top scored tweets for each country, and within each country finds the top scored tweets for each user.

GET /twitter/_search
{
    "query": {
        "match": {
            "message": "elasticsearch"
        }
    },
    "collapse" : {
        "field" : "country",
        "inner_hits" : {
            "name": "by_location",
            "collapse" : {"field" : "user"},
            "size": 3
        }
    }
}

Response:

{
    ...
    "hits": [
        {
            "_index": "twitter",
            "_type": "_doc",
            "_id": "9",
            "_score": ...,
            "_source": {...},
            "fields": {"country": ["UK"]},
            "inner_hits":{
                "by_location": {
                    "hits": {
                       ...,
                       "hits": [
                          {
                            ...
                            "fields": {"user" : ["user124"]}
                          },
                          {
                            ...
                            "fields": {"user" : ["user589"]}
                          },
                          {
                            ...
                             "fields": {"user" : ["user001"]}
                          }
                       ]
                    }
                 }
            }
        },
        {
            "_index": "twitter",
            "_type": "_doc",
            "_id": "1",
            "_score": ..,
            "_source": {...},
            "fields": {"country": ["Canada"]},
            "inner_hits":{
                "by_location": {
                    "hits": {
                       ...,
                       "hits": [
                          {
                            ...
                            "fields": {"user" : ["user444"]}
                          },
                          {
                            ...
                            "fields": {"user" : ["user1111"]}
                          },
                          {
                            ...
                             "fields": {"user" : ["user999"]}
                          }
                       ]
                    }
                 }
            }

        },
        ....
    ]
}

Second level of collapsing doesn’t allow inner_hits.

Highlighting

edit

Highlighters enable you to get highlighted snippets from one or more fields in your search results so you can show users where the query matches are. When you request highlights, the response contains an additional highlight element for each search hit that includes the highlighted fields and the highlighted fragments.

Highlighters don’t reflect the boolean logic of a query when extracting terms to highlight. Thus, for some complex boolean queries (e.g nested boolean queries, queries using minimum_should_match etc.), parts of documents may be highlighted that don’t correspond to query matches.

Highlighting requires the actual content of a field. If the field is not stored (the mapping does not set store to true), the actual _source is loaded and the relevant field is extracted from _source.

For example, to get highlights for the content field in each search hit using the default highlighter, include a highlight object in the request body that specifies the content field:

GET /_search
{
    "query" : {
        "match": { "content": "kimchy" }
    },
    "highlight" : {
        "fields" : {
            "content" : {}
        }
    }
}

Copy as curl Try in Elastic

Elasticsearch supports three highlighters: unified, plain, and fvh (fast vector highlighter). You can specify the highlighter type you want to use for each field.

Unified highlighter

edit

The unified highlighter uses the Lucene Unified Highlighter. This highlighter breaks the text into sentences and uses the BM25 algorithm to score individual sentences as if they were documents in the corpus. It also supports accurate phrase and multi-term (fuzzy, prefix, regex) highlighting. This is the default highlighter.

Plain highlighter

edit

The plain highlighter uses the standard Lucene highlighter. It attempts to reflect the query matching logic in terms of understanding word importance and any word positioning criteria in phrase queries.

The plain highlighter works best for highlighting simple query matches in a single field. To accurately reflect query logic, it creates a tiny in-memory index and re-runs the original query criteria through Lucene’s query execution planner to get access to low-level match information for the current document. This is repeated for every field and every document that needs to be highlighted. If you want to highlight a lot of fields in a lot of documents with complex queries, we recommend using the unified highlighter on postings or term_vector fields.

Fast vector highlighter

edit

The fvh highlighter uses the Lucene Fast Vector highlighter. This highlighter can be used on fields with term_vector set to with_positions_offsets in the mapping. The fast vector highlighter:

Can be customized with a boundary_scanner.
Requires setting term_vector to with_positions_offsets which increases the size of the index
Can combine matches from multiple fields into one result. See matched_fields
Can assign different weights to matches at different positions allowing for things like phrase matches being sorted above term matches when highlighting a Boosting Query that boosts phrase matches over term matches

The fvh highlighter does not support span queries. If you need support for span queries, try an alternative highlighter, such as the unified highlighter.

Offsets Strategy

edit

To create meaningful search snippets from the terms being queried, the highlighter needs to know the start and end character offsets of each word in the original text. These offsets can be obtained from:

The postings list. If index_options is set to offsets in the mapping, the unified highlighter uses this information to highlight documents without re-analyzing the text. It re-runs the original query directly on the postings and extracts the matching offsets from the index, limiting the collection to the highlighted documents. This is important if you have large fields because it doesn’t require reanalyzing the text to be highlighted. It also requires less disk space than using term_vectors.
Term vectors. If term_vector information is provided by setting term_vector to with_positions_offsets in the mapping, the unified highlighter automatically uses the term_vector to highlight the field. It’s fast especially for large fields (> 1MB) and for highlighting multi-term queries like prefix or wildcard because it can access the dictionary of terms for each document. The fvh highlighter always uses term vectors.
Plain highlighting. This mode is used by the unified when there is no other alternative. It creates a tiny in-memory index and re-runs the original query criteria through Lucene’s query execution planner to get access to low-level match information on the current document. This is repeated for every field and every document that needs highlighting. The plain highlighter always uses plain highlighting.

Plain highlighting for large texts may require substantial amount of time and memory. To protect against this, the maximum number of text characters that will be analyzed has been limited to 1000000. This default limit can be changed for a particular index with the index setting index.highlight.max_analyzed_offset.

Highlighting Settings

edit

Highlighting settings can be set on a global level and overridden at the field level.

boundary_chars: A string that contains each boundary character. Defaults to .,!? \t\n.
boundary_max_scan: How far to scan for boundary characters. Defaults to 20.

boundary_scanner

Specifies how to break the highlighted fragments: chars, sentence, or word. Only valid for the unified and fvh highlighters. Defaults to sentence for the unified highlighter. Defaults to chars for the fvh highlighter.

chars: Use the characters specified by boundary_chars as highlighting boundaries. The boundary_max_scan setting controls how far to scan for boundary characters. Only valid for the fvh highlighter.
sentence: Break highlighted fragments at the next sentence boundary, as determined by Java’s BreakIterator. You can specify the locale to use with boundary_scanner_locale.

When used with the unified highlighter, the sentence scanner splits sentences bigger than fragment_size at the first word boundary next to fragment_size. You can set fragment_size to 0 to never split any sentence.
word: Break highlighted fragments at the next word boundary, as determined by Java’s BreakIterator. You can specify the locale to use with boundary_scanner_locale.

boundary_scanner_locale

Controls which locale is used to search for sentence and word boundaries. This parameter takes a form of a language tag, e.g. "en-US", "fr-FR", "ja-JP". More info can be found in the Locale Language Tag documentation. The default value is Locale.ROOT.

encoder

Indicates if the snippet should be HTML encoded: default (no encoding) or html (HTML-escape the snippet text and then insert the highlighting tags)

fields

Specifies the fields to retrieve highlights for. You can use wildcards to specify fields. For example, you could specify comment_* to get highlights for all text and keyword fields that start with comment_.

Only text and keyword fields are highlighted when you use wildcards. If you use a custom mapper and want to highlight on a field anyway, you must explicitly specify that field name.

force_source

Highlight based on the source even if the field is stored separately. Defaults to false.

fragmenter

Specifies how text should be broken up in highlight snippets: simple or span. Only valid for the plain highlighter. Defaults to span.

simple: Breaks up text into same-sized fragments.
span: Breaks up text into same-sized fragments, but tries to avoid breaking up text between highlighted terms. This is helpful when you’re querying for phrases. Default.

fragment_offset

Controls the margin from which you want to start highlighting. Only valid when using the fvh highlighter.

fragment_size

The size of the highlighted fragment in characters. Defaults to 100.

highlight_query

Highlight matches for a query other than the search query. This is especially useful if you use a rescore query because those are not taken into account by highlighting by default.

Elasticsearch does not validate that highlight_query contains the search query in any way so it is possible to define it so legitimate query results are not highlighted. Generally, you should include the search query as part of the highlight_query.

matched_fields

Combine matches on multiple fields to highlight a single field. This is most intuitive for multifields that analyze the same string in different ways. All matched_fields must have term_vector set to with_positions_offsets, but only the field to which the matches are combined is loaded so only that field benefits from having store set to yes. Only valid for the fvh highlighter.

no_match_size

The amount of text you want to return from the beginning of the field if there are no matching fragments to highlight. Defaults to 0 (nothing is returned).

number_of_fragments

The maximum number of fragments to return. If the number of fragments is set to 0, no fragments are returned. Instead, the entire field contents are highlighted and returned. This can be handy when you need to highlight short texts such as a title or address, but fragmentation is not required. If number_of_fragments is 0, fragment_size is ignored. Defaults to 5.

order

Sorts highlighted fragments by score when set to score. By default, fragments will be output in the order they appear in the field (order: none). Setting this option to score will output the most relevant fragments first. Each highlighter applies its own logic to compute relevancy scores. See the document How highlighters work internally for more details how different highlighters find the best fragments.

phrase_limit

Controls the number of matching phrases in a document that are considered. Prevents the fvh highlighter from analyzing too many phrases and consuming too much memory. When using matched_fields, phrase_limit phrases per matched field are considered. Raising the limit increases query time and consumes more memory. Only supported by the fvh highlighter. Defaults to 256.

pre_tags

Use in conjunction with post_tags to define the HTML tags to use for the highlighted text. By default, highlighted text is wrapped in  and  tags. Specify as an array of strings.

post_tags

Use in conjunction with pre_tags to define the HTML tags to use for the highlighted text. By default, highlighted text is wrapped in  and  tags. Specify as an array of strings.

require_field_match

By default, only fields that contains a query match are highlighted. Set require_field_match to false to highlight all fields. Defaults to true.

tags_schema

Set to styled to use the built-in tag schema. The styled schema defines the following pre_tags and defines post_tags as .

<em class="hlt1">, <em class="hlt2">, <em class="hlt3">,
<em class="hlt4">, <em class="hlt5">, <em class="hlt6">,
<em class="hlt7">, <em class="hlt8">, <em class="hlt9">,
<em class="hlt10">

type: The highlighter to use: unified, plain, or fvh. Defaults to unified.

Highlighting Examples

edit

Override global settings

edit

You can specify highlighter settings globally and selectively override them for individual fields.

GET /_search
{
    "query" : {
        "match": { "user": "kimchy" }
    },
    "highlight" : {
        "number_of_fragments" : 3,
        "fragment_size" : 150,
        "fields" : {
            "body" : { "pre_tags" : ["<em>"], "post_tags" : ["</em>"] },
            "blog.title" : { "number_of_fragments" : 0 },
            "blog.author" : { "number_of_fragments" : 0 },
            "blog.comment" : { "number_of_fragments" : 5, "order" : "score" }
        }
    }
}

Copy as curl Try in Elastic

Specify a highlight query

edit

You can specify a highlight_query to take additional information into account when highlighting. For example, the following query includes both the search query and rescore query in the highlight_query. Without the highlight_query, highlighting would only take the search query into account.

GET /_search
{
    "query" : {
        "match": {
            "comment": {
                "query": "foo bar"
            }
        }
    },
    "rescore": {
        "window_size": 50,
        "query": {
            "rescore_query" : {
                "match_phrase": {
                    "comment": {
                        "query": "foo bar",
                        "slop": 1
                    }
                }
            },
            "rescore_query_weight" : 10
        }
    },
    "_source": false,
    "highlight" : {
        "order" : "score",
        "fields" : {
            "comment" : {
                "fragment_size" : 150,
                "number_of_fragments" : 3,
                "highlight_query": {
                    "bool": {
                        "must": {
                            "match": {
                                "comment": {
                                    "query": "foo bar"
                                }
                            }
                        },
                        "should": {
                            "match_phrase": {
                                "comment": {
                                    "query": "foo bar",
                                    "slop": 1,
                                    "boost": 10.0
                                }
                            }
                        },
                        "minimum_should_match": 0
                    }
                }
            }
        }
    }
}

Copy as curl Try in Elastic

Set highlighter type

edit

The type field allows to force a specific highlighter type. The allowed values are: unified, plain and fvh. The following is an example that forces the use of the plain highlighter:

GET /_search
{
    "query" : {
        "match": { "user": "kimchy" }
    },
    "highlight" : {
        "fields" : {
            "comment" : {"type" : "plain"}
        }
    }
}

Copy as curl Try in Elastic

Configure highlighting tags

edit

By default, the highlighting will wrap highlighted text in  and . This can be controlled by setting pre_tags and post_tags, for example:

GET /_search
{
    "query" : {
        "match": { "user": "kimchy" }
    },
    "highlight" : {
        "pre_tags" : ["<tag1>"],
        "post_tags" : ["</tag1>"],
        "fields" : {
            "body" : {}
        }
    }
}

Copy as curl Try in Elastic

When using the fast vector highlighter, you can specify additional tags and the "importance" is ordered.

GET /_search
{
    "query" : {
        "match": { "user": "kimchy" }
    },
    "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "body" : {}
        }
    }
}

Copy as curl Try in Elastic

You can also use the built-in styled tag schema:

GET /_search
{
    "query" : {
        "match": { "user": "kimchy" }
    },
    "highlight" : {
        "tags_schema" : "styled",
        "fields" : {
            "comment" : {}
        }
    }
}

Copy as curl Try in Elastic

Highlight on source

edit

Forces the highlighting to highlight fields based on the source even if fields are stored separately. Defaults to false.

GET /_search
{
    "query" : {
        "match": { "user": "kimchy" }
    },
    "highlight" : {
        "fields" : {
            "comment" : {"force_source" : true}
        }
    }
}

Copy as curl Try in Elastic

Highlight in all fields

edit

By default, only fields that contains a query match are highlighted. Set require_field_match to false to highlight all fields.

GET /_search
{
    "query" : {
        "match": { "user": "kimchy" }
    },
    "highlight" : {
        "require_field_match": false,
        "fields": {
                "body" : { "pre_tags" : ["<em>"], "post_tags" : ["</em>"] }
        }
    }
}

Copy as curl Try in Elastic

Combine matches on multiple fields

edit

This is only supported by the fvh highlighter

The Fast Vector Highlighter can combine matches on multiple fields to highlight a single field. This is most intuitive for multifields that analyze the same string in different ways. All matched_fields must have term_vector set to with_positions_offsets but only the field to which the matches are combined is loaded so only that field would benefit from having store set to yes.

In the following examples, comment is analyzed by the english analyzer and comment.plain is analyzed by the standard analyzer.

GET /_search
{
    "query": {
        "query_string": {
            "query": "comment.plain:running scissors",
            "fields": ["comment"]
        }
    },
    "highlight": {
        "order": "score",
        "fields": {
            "comment": {
                "matched_fields": ["comment", "comment.plain"],
                "type" : "fvh"
            }
        }
    }
}

Copy as curl Try in Elastic

The above matches both "run with scissors" and "running with scissors" and would highlight "running" and "scissors" but not "run". If both phrases appear in a large document then "running with scissors" is sorted above "run with scissors" in the fragments list because there are more matches in that fragment.

GET /_search
{
    "query": {
        "query_string": {
            "query": "running scissors",
            "fields": ["comment", "comment.plain^10"]
        }
    },
    "highlight": {
        "order": "score",
        "fields": {
            "comment": {
                "matched_fields": ["comment", "comment.plain"],
                "type" : "fvh"
            }
        }
    }
}

Copy as curl Try in Elastic

The above highlights "run" as well as "running" and "scissors" but still sorts "running with scissors" above "run with scissors" because the plain match ("running") is boosted.

GET /_search
{
    "query": {
        "query_string": {
            "query": "running scissors",
            "fields": ["comment", "comment.plain^10"]
        }
    },
    "highlight": {
        "order": "score",
        "fields": {
            "comment": {
                "matched_fields": ["comment.plain"],
                "type" : "fvh"
            }
        }
    }
}

Copy as curl Try in Elastic

The above query wouldn’t highlight "run" or "scissor" but shows that it is just fine not to list the field to which the matches are combined (comment) in the matched fields.

Technically it is also fine to add fields to matched_fields that don’t share the same underlying string as the field to which the matches are combined. The results might not make much sense and if one of the matches is off the end of the text then the whole query will fail.

There is a small amount of overhead involved with setting matched_fields to a non-empty array so always prefer

    "highlight": {
        "fields": {
            "comment": {}
        }
    }

    "highlight": {
        "fields": {
            "comment": {
                "matched_fields": ["comment"],
                "type" : "fvh"
            }
        }
    }

Explicitly order highlighted fields

edit

Elasticsearch highlights the fields in the order that they are sent, but per the JSON spec, objects are unordered. If you need to be explicit about the order in which fields are highlighted specify the fields as an array:

GET /_search
{
    "highlight": {
        "fields": [
            { "title": {} },
            { "text": {} }
        ]
    }
}

Copy as curl Try in Elastic

None of the highlighters built into Elasticsearch care about the order that the fields are highlighted but a plugin might.

Control highlighted fragments

edit

Each field highlighted can control the size of the highlighted fragment in characters (defaults to 100), and the maximum number of fragments to return (defaults to 5). For example:

GET /_search
{
    "query" : {
        "match": { "user": "kimchy" }
    },
    "highlight" : {
        "fields" : {
            "comment" : {"fragment_size" : 150, "number_of_fragments" : 3}
        }
    }
}

Copy as curl Try in Elastic

On top of this it is possible to specify that highlighted fragments need to be sorted by score:

GET /_search
{
    "query" : {
        "match": { "user": "kimchy" }
    },
    "highlight" : {
        "order" : "score",
        "fields" : {
            "comment" : {"fragment_size" : 150, "number_of_fragments" : 3}
        }
    }
}

Copy as curl Try in Elastic

If the number_of_fragments value is set to 0 then no fragments are produced, instead the whole content of the field is returned, and of course it is highlighted. This can be very handy if short texts (like document title or address) need to be highlighted but no fragmentation is required. Note that fragment_size is ignored in this case.

GET /_search
{
    "query" : {
        "match": { "user": "kimchy" }
    },
    "highlight" : {
        "fields" : {
            "body" : {},
            "blog.title" : {"number_of_fragments" : 0}
        }
    }
}

Copy as curl Try in Elastic

When using fvh one can use fragment_offset parameter to control the margin to start highlighting from.

In the case where there is no matching fragment to highlight, the default is to not return anything. Instead, we can return a snippet of text from the beginning of the field by setting no_match_size (default 0) to the length of the text that you want returned. The actual length may be shorter or longer than specified as it tries to break on a word boundary.

GET /_search
{
    "query" : {
        "match": { "user": "kimchy" }
    },
    "highlight" : {
        "fields" : {
            "comment" : {
                "fragment_size" : 150,
                "number_of_fragments" : 3,
                "no_match_size": 150
            }
        }
    }
}

Copy as curl Try in Elastic

Highlight using the postings list

edit

Here is an example of setting the comment field in the index mapping to allow for highlighting using the postings:

PUT /example
{
  "mappings": {
    "properties": {
      "comment" : {
        "type": "text",
        "index_options" : "offsets"
      }
    }
  }
}

Copy as curl Try in Elastic

Here is an example of setting the comment field to allow for highlighting using the term_vectors (this will cause the index to be bigger):

PUT /example
{
  "mappings": {
    "properties": {
      "comment" : {
        "type": "text",
        "term_vector" : "with_positions_offsets"
      }
    }
  }
}

Copy as curl Try in Elastic

Specify a fragmenter for the plain highlighter

edit

When using the plain highlighter, you can choose between the simple and span fragmenters:

GET twitter/_search
{
    "query" : {
        "match_phrase": { "message": "number 1" }
    },
    "highlight" : {
        "fields" : {
            "message" : {
                "type": "plain",
                "fragment_size" : 15,
                "number_of_fragments" : 3,
                "fragmenter": "simple"
            }
        }
    }
}

Copy as curl Try in Elastic

Response:

{
    ...
    "hits": {
        "total" : {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.6011951,
        "hits": [
            {
                "_index": "twitter",
                "_type": "_doc",
                "_id": "1",
                "_score": 1.6011951,
                "_source": {
                    "user": "test",
                    "message": "some message with the number 1",
                    "date": "2009-11-15T14:12:12",
                    "likes": 1
                },
                "highlight": {
                    "message": [
                        " with the <em>number</em>",
                        " <em>1</em>"
                    ]
                }
            }
        ]
    }
}

GET twitter/_search
{
    "query" : {
        "match_phrase": { "message": "number 1" }
    },
    "highlight" : {
        "fields" : {
            "message" : {
                "type": "plain",
                "fragment_size" : 15,
                "number_of_fragments" : 3,
                "fragmenter": "span"
            }
        }
    }
}

Copy as curl Try in Elastic

Response:

{
    ...
    "hits": {
        "total" : {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.6011951,
        "hits": [
            {
                "_index": "twitter",
                "_type": "_doc",
                "_id": "1",
                "_score": 1.6011951,
                "_source": {
                    "user": "test",
                    "message": "some message with the number 1",
                    "date": "2009-11-15T14:12:12",
                    "likes": 1
                },
                "highlight": {
                    "message": [
                        " with the <em>number</em> <em>1</em>"
                    ]
                }
            }
        ]
    }
}

If the number_of_fragments option is set to 0, NullFragmenter is used which does not fragment the text at all. This is useful for highlighting the entire contents of a document or field.

How highlighters work internally

edit

Given a query and a text (the content of a document field), the goal of a highlighter is to find the best text fragments for the query, and highlight the query terms in the found fragments. For this, a highlighter needs to address several questions:

How break a text into fragments?
How to find the best fragments among all fragments?
How to highlight the query terms in a fragment?

How to break a text into fragments?

edit

Relevant settings: fragment_size, fragmenter, type of highlighter, boundary_chars, boundary_max_scan, boundary_scanner, boundary_scanner_locale.

Plain highlighter begins with analyzing the text using the given analyzer, and creating a token stream from it. Plain highlighter uses a very simple algorithm to break the token stream into fragments. It loops through terms in the token stream, and every time the current term’s end_offset exceeds fragment_size multiplied by the number of created fragments, a new fragment is created. A little more computation is done with using span fragmenter to avoid breaking up text between highlighted terms. But overall, since the breaking is done only by fragment_size, some fragments can be quite odd, e.g. beginning with a punctuation mark.

Unified or FVH highlighters do a better job of breaking up a text into fragments by utilizing Java’s BreakIterator. This ensures that a fragment is a valid sentence as long as fragment_size allows for this.

How to find the best fragments?

edit

Relevant settings: number_of_fragments.

To find the best, most relevant, fragments, a highlighter needs to score each fragment in respect to the given query. The goal is to score only those terms that participated in generating the hit on the document. For some complex queries, this is still work in progress.

The plain highlighter creates an in-memory index from the current token stream, and re-runs the original query criteria through Lucene’s query execution planner to get access to low-level match information for the current text. For more complex queries the original query could be converted to a span query, as span queries can handle phrases more accurately. Then this obtained low-level match information is used to score each individual fragment. The scoring method of the plain highlighter is quite simple. Each fragment is scored by the number of unique query terms found in this fragment. The score of individual term is equal to its boost, which is by default is 1. Thus, by default, a fragment that contains one unique query term, will get a score of 1; and a fragment that contains two unique query terms, will get a score of 2 and so on. The fragments are then sorted by their scores, so the highest scored fragments will be output first.

FVH doesn’t need to analyze the text and build an in-memory index, as it uses pre-indexed document term vectors, and finds among them terms that correspond to the query. FVH scores each fragment by the number of query terms found in this fragment. Similarly to plain highlighter, score of individual term is equal to its boost value. In contrast to plain highlighter, all query terms are counted, not only unique terms.

Unified highlighter can use pre-indexed term vectors or pre-indexed terms offsets, if they are available. Otherwise, similar to Plain Highlighter, it has to create an in-memory index from the text. Unified highlighter uses the BM25 scoring model to score fragments.

How to highlight the query terms in a fragment?

edit

Relevant settings: pre-tags, post-tags.

The goal is to highlight only those terms that participated in generating the hit on the document. For some complex boolean queries, this is still work in progress, as highlighters don’t reflect the boolean logic of a query and only extract leaf (terms, phrases, prefix etc) queries.

Plain highlighter given the token stream and the original text, recomposes the original text to highlight only terms from the token stream that are contained in the low-level match information structure from the previous step.

FVH and unified highlighter use intermediate data structures to represent fragments in some raw form, and then populate them with actual text.

A highlighter uses pre-tags, post-tags to encode highlighted terms.

An example of the work of the unified highlighter

edit

Let’s look in more details how unified highlighter works.

First, we create a index with a text field content, that will be indexed using english analyzer, and will be indexed without offsets or term vectors.

PUT test_index
{
    "mappings": {
        "properties": {
            "content" : {
                "type" : "text",
                "analyzer" : "english"
            }
        }
    }
}

We put the following document into the index:

PUT test_index/_doc/doc1
{
  "content" : "For you I'm only a fox like a hundred thousand other foxes. But if you tame me, we'll need each other. You'll be the only boy in the world for me. I'll be the only fox in the world for you."
}

And we ran the following query with a highlight request:

GET test_index/_search
{
    "query": {
        "match_phrase" : {"content" : "only fox"}
    },
    "highlight": {
        "type" : "unified",
        "number_of_fragments" : 3,
        "fields": {
            "content": {}
        }
    }
}

After doc1 is found as a hit for this query, this hit will be passed to the unified highlighter for highlighting the field content of the document. Since the field content was not indexed either with offsets or term vectors, its raw field value will be analyzed, and in-memory index will be built from the terms that match the query:

{"token":"onli","start_offset":12,"end_offset":16,"position":3},
{"token":"fox","start_offset":19,"end_offset":22,"position":5},
{"token":"fox","start_offset":53,"end_offset":58,"position":11},
{"token":"onli","start_offset":117,"end_offset":121,"position":24},
{"token":"onli","start_offset":159,"end_offset":163,"position":34},
{"token":"fox","start_offset":164,"end_offset":167,"position":35}

Our complex phrase query will be converted to the span query: spanNear([text:onli, text:fox], 0, true), meaning that we are looking for terms "onli: and "fox" within 0 distance from each other, and in the given order. The span query will be run against the created before in-memory index, to find the following match:

{"term":"onli", "start_offset":159, "end_offset":163},
{"term":"fox", "start_offset":164, "end_offset":167}

In our example, we have got a single match, but there could be several matches. Given the matches, the unified highlighter breaks the text of the field into so called "passages". Each passage must contain at least one match. The unified highlighter with the use of Java’s BreakIterator ensures that each passage represents a full sentence as long as it doesn’t exceed fragment_size. For our example, we have got a single passage with the following properties (showing only a subset of the properties here):

Passage:
    startOffset: 147
    endOffset: 189
    score: 3.7158387
    matchStarts: [159, 164]
    matchEnds: [163, 167]
    numMatches: 2

Notice how a passage has a score, calculated using the BM25 scoring formula adapted for passages. Scores allow us to choose the best scoring passages if there are more passages available than the requested by the user number_of_fragments. Scores also let us to sort passages by order: "score" if requested by the user.

As the final step, the unified highlighter will extract from the field’s text a string corresponding to each passage:

"I'll be the only fox in the world for you."

and will format with the tags and all matches in this string using the passages’s matchStarts and matchEnds information:

I'll be the <em>only</em> <em>fox</em> in the world for you.

This kind of formatted strings are the final result of the highlighter returned to the user.

Index Boost

edit

Allows to configure different boost level per index when searching across more than one indices. This is very handy when hits coming from one index matter more than hits coming from another index (think social graph where each user has an index).

Deprecated in 5.2.0.

This format is deprecated. Please use array format instead.

GET /_search
{
    "indices_boost" : {
        "index1" : 1.4,
        "index2" : 1.3
    }
}

Copy as curl Try in Elastic

You can also specify it as an array to control the order of boosts.

GET /_search
{
    "indices_boost" : [
        { "alias1" : 1.4 },
        { "index*" : 1.3 }
    ]
}

Copy as curl Try in Elastic

This is important when you use aliases or wildcard expression. If multiple matches are found, the first match will be used. For example, if an index is included in both alias1 and index*, boost value of 1.4 is applied.

Inner hits

edit

The parent-join and nested features allow the return of documents that have matches in a different scope. In the parent/child case, parent documents are returned based on matches in child documents or child documents are returned based on matches in parent documents. In the nested case, documents are returned based on matches in nested inner objects.

In both cases, the actual matches in the different scopes that caused a document to be returned are hidden. In many cases, it’s very useful to know which inner nested objects (in the case of nested) or children/parent documents (in the case of parent/child) caused certain information to be returned. The inner hits feature can be used for this. This feature returns per search hit in the search response additional nested hits that caused a search hit to match in a different scope.

Inner hits can be used by defining an inner_hits definition on a nested, has_child or has_parent query and filter. The structure looks like this:

"<query>" : {
    "inner_hits" : {
        <inner_hits_options>
    }
}

If inner_hits is defined on a query that supports it then each search hit will contain an inner_hits json object with the following structure:

"hits": [
     {
        "_index": ...,
        "_type": ...,
        "_id": ...,
        "inner_hits": {
           "<inner_hits_name>": {
              "hits": {
                 "total": ...,
                 "hits": [
                    {
                       "_type": ...,
                       "_id": ...,
                       ...
                    },
                    ...
                 ]
              }
           }
        },
        ...
     },
     ...
]

Options

edit

Inner hits support the following options:

`from`	The offset from where the first hit to fetch for each `inner_hits` in the returned regular search hits.
`size`	The maximum number of hits to return per `inner_hits`. By default the top three matching hits are returned.
`sort`	How the inner hits should be sorted per `inner_hits`. By default the hits are sorted by the score.
`name`	The name to be used for the particular inner hit definition in the response. Useful when multiple inner hits have been defined in a single search request. The default depends in which query the inner hit is defined. For `has_child` query and filter this is the child type, `has_parent` query and filter this is the parent type and the nested query and filter this is the nested path.

Inner hits also supports the following per document features:

Nested inner hits

edit

The nested inner_hits can be used to include nested inner objects as inner hits to a search hit.

PUT test
{
  "mappings": {
    "properties": {
      "comments": {
        "type": "nested"
      }
    }
  }
}

PUT test/_doc/1?refresh
{
  "title": "Test title",
  "comments": [
    {
      "author": "kimchy",
      "number": 1
    },
    {
      "author": "nik9000",
      "number": 2
    }
  ]
}

POST test/_search
{
  "query": {
    "nested": {
      "path": "comments",
      "query": {
        "match": {"comments.number" : 2}
      },
      "inner_hits": {} 
    }
  }
}

Copy as curl Try in Elastic

The inner hit definition in the nested query. No other options need to be defined.

An example of a response snippet that could be generated from the above search request:

{
  ...,
  "hits": {
    "total" : {
        "value": 1,
        "relation": "eq"
    },
    "max_score": 1.0,
    "hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "1",
        "_score": 1.0,
        "_source": ...,
        "inner_hits": {
          "comments": { 
            "hits": {
              "total" : {
                  "value": 1,
                  "relation": "eq"
              },
              "max_score": 1.0,
              "hits": [
                {
                  "_index": "test",
                  "_type": "_doc",
                  "_id": "1",
                  "_nested": {
                    "field": "comments",
                    "offset": 1
                  },
                  "_score": 1.0,
                  "_source": {
                    "author": "nik9000",
                    "number": 2
                  }
                }
              ]
            }
          }
        }
      }
    ]
  }
}

The name used in the inner hit definition in the search request. A custom key can be used via the name option.

The _nested metadata is crucial in the above example, because it defines from what inner nested object this inner hit came from. The field defines the object array field the nested hit is from and the offset relative to its location in the _source. Due to sorting and scoring the actual location of the hit objects in the inner_hits is usually different than the location a nested inner object was defined.

By default the _source is returned also for the hit objects in inner_hits, but this can be changed. Either via _source filtering feature part of the source can be returned or be disabled. If stored fields are defined on the nested level these can also be returned via the fields feature.

An important default is that the _source returned in hits inside inner_hits is relative to the _nested metadata. So in the above example only the comment part is returned per nested hit and not the entire source of the top level document that contained the comment.

Nested inner hits and `_source`

edit

Nested document don’t have a _source field, because the entire source of document is stored with the root document under its _source field. To include the source of just the nested document, the source of the root document is parsed and just the relevant bit for the nested document is included as source in the inner hit. Doing this for each matching nested document has an impact on the time it takes to execute the entire search request, especially when size and the inner hits' size are set higher than the default. To avoid the relatively expensive source extraction for nested inner hits, one can disable including the source and solely rely on doc values fields. Like this:

PUT test
{
  "mappings": {
    "properties": {
      "comments": {
        "type": "nested"
      }
    }
  }
}

PUT test/_doc/1?refresh
{
  "title": "Test title",
  "comments": [
    {
      "author": "kimchy",
      "text": "comment text"
    },
    {
      "author": "nik9000",
      "text": "words words words"
    }
  ]
}

POST test/_search
{
  "query": {
    "nested": {
      "path": "comments",
      "query": {
        "match": {"comments.text" : "words"}
      },
      "inner_hits": {
        "_source" : false,
        "docvalue_fields" : [
          "comments.text.keyword"
        ]
      }
    }
  }
}

Copy as curl Try in Elastic

Hierarchical levels of nested object fields and inner hits.

edit

If a mapping has multiple levels of hierarchical nested object fields each level can be accessed via dot notated path. For example if there is a comments nested field that contains a votes nested field and votes should directly be returned with the root hits then the following path can be defined:

PUT test
{
  "mappings": {
    "properties": {
      "comments": {
        "type": "nested",
        "properties": {
          "votes": {
            "type": "nested"
          }
        }
      }
    }
  }
}

PUT test/_doc/1?refresh
{
  "title": "Test title",
  "comments": [
    {
      "author": "kimchy",
      "text": "comment text",
      "votes": []
    },
    {
      "author": "nik9000",
      "text": "words words words",
      "votes": [
        {"value": 1 , "voter": "kimchy"},
        {"value": -1, "voter": "other"}
      ]
    }
  ]
}

POST test/_search
{
  "query": {
    "nested": {
      "path": "comments.votes",
        "query": {
          "match": {
            "comments.votes.voter": "kimchy"
          }
        },
        "inner_hits" : {}
    }
  }
}

Copy as curl Try in Elastic

Which would look like:

{
  ...,
  "hits": {
    "total" : {
        "value": 1,
        "relation": "eq"
    },
    "max_score": 0.6931471,
    "hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.6931471,
        "_source": ...,
        "inner_hits": {
          "comments.votes": { 
            "hits": {
              "total" : {
                  "value": 1,
                  "relation": "eq"
              },
              "max_score": 0.6931471,
              "hits": [
                {
                  "_index": "test",
                  "_type": "_doc",
                  "_id": "1",
                  "_nested": {
                    "field": "comments",
                    "offset": 1,
                    "_nested": {
                      "field": "votes",
                      "offset": 0
                    }
                  },
                  "_score": 0.6931471,
                  "_source": {
                    "value": 1,
                    "voter": "kimchy"
                  }
                }
              ]
            }
          }
        }
      }
    ]
  }
}

This indirect referencing is only supported for nested inner hits.

Parent/child inner hits

edit

The parent/child inner_hits can be used to include parent or child:

PUT test
{
  "mappings": {
    "properties": {
      "my_join_field": {
        "type": "join",
        "relations": {
          "my_parent": "my_child"
        }
      }
    }
  }
}

PUT test/_doc/1?refresh
{
  "number": 1,
  "my_join_field": "my_parent"
}

PUT test/_doc/2?routing=1&refresh
{
  "number": 1,
  "my_join_field": {
    "name": "my_child",
    "parent": "1"
  }
}

POST test/_search
{
  "query": {
    "has_child": {
      "type": "my_child",
      "query": {
        "match": {
          "number": 1
        }
      },
      "inner_hits": {}    
    }
  }
}

Copy as curl Try in Elastic

The inner hit definition like in the nested example.

An example of a response snippet that could be generated from the above search request:

{
    ...,
    "hits": {
        "total" : {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "test",
                "_type": "_doc",
                "_id": "1",
                "_score": 1.0,
                "_source": {
                    "number": 1,
                    "my_join_field": "my_parent"
                },
                "inner_hits": {
                    "my_child": {
                        "hits": {
                            "total" : {
                                "value": 1,
                                "relation": "eq"
                            },
                            "max_score": 1.0,
                            "hits": [
                                {
                                    "_index": "test",
                                    "_type": "_doc",
                                    "_id": "2",
                                    "_score": 1.0,
                                    "_routing": "1",
                                    "_source": {
                                        "number": 1,
                                        "my_join_field": {
                                            "name": "my_child",
                                            "parent": "1"
                                        }
                                    }
                                }
                            ]
                        }
                    }
                }
            }
        ]
    }
}

min_score

edit

Exclude documents which have a _score less than the minimum specified in min_score:

GET /_search
{
    "min_score": 0.5,
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Copy as curl Try in Elastic

Note, most times, this does not make much sense, but is provided for advanced use cases.

Named Queries

edit

Each filter and query can accept a _name in its top level definition.

GET /_search
{
    "query": {
        "bool" : {
            "should" : [
                {"match" : { "name.first" : {"query" : "shay", "_name" : "first"} }},
                {"match" : { "name.last" : {"query" : "banon", "_name" : "last"} }}
            ],
            "filter" : {
                "terms" : {
                    "name.last" : ["banon", "kimchy"],
                    "_name" : "test"
                }
            }
        }
    }
}

Copy as curl Try in Elastic

The search response will include for each hit the matched_queries it matched on. The tagging of queries and filters only make sense for the bool query.

Post filter

edit

The post_filter is applied to the search hits at the very end of a search request, after aggregations have already been calculated. Its purpose is best explained by example:

Imagine that you are selling shirts that have the following properties:

PUT /shirts
{
    "mappings": {
        "properties": {
            "brand": { "type": "keyword"},
            "color": { "type": "keyword"},
            "model": { "type": "keyword"}
        }
    }
}

PUT /shirts/_doc/1?refresh
{
    "brand": "gucci",
    "color": "red",
    "model": "slim"
}

Copy as curl Try in Elastic

Imagine a user has specified two filters:

color:red and brand:gucci. You only want to show them red shirts made by Gucci in the search results. Normally you would do this with a bool query:

GET /shirts/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "color": "red"   }},
        { "term": { "brand": "gucci" }}
      ]
    }
  }
}

Copy as curl Try in Elastic

However, you would also like to use faceted navigation to display a list of other options that the user could click on. Perhaps you have a model field that would allow the user to limit their search results to red Gucci t-shirts or dress-shirts.

This can be done with a terms aggregation:

GET /shirts/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "color": "red"   }},
        { "term": { "brand": "gucci" }}
      ]
    }
  },
  "aggs": {
    "models": {
      "terms": { "field": "model" } 
    }
  }
}

Copy as curl Try in Elastic

Returns the most popular models of red shirts by Gucci.

But perhaps you would also like to tell the user how many Gucci shirts are available in other colors. If you just add a terms aggregation on the color field, you will only get back the color red, because your query returns only red shirts by Gucci.

Instead, you want to include shirts of all colors during aggregation, then apply the colors filter only to the search results. This is the purpose of the post_filter:

GET /shirts/_search
{
  "query": {
    "bool": {
      "filter": {
        "term": { "brand": "gucci" } 
      }
    }
  },
  "aggs": {
    "colors": {
      "terms": { "field": "color" } 
    },
    "color_red": {
      "filter": {
        "term": { "color": "red" } 
      },
      "aggs": {
        "models": {
          "terms": { "field": "model" } 
        }
      }
    }
  },
  "post_filter": { 
    "term": { "color": "red" }
  }
}

Copy as curl Try in Elastic

	The main query now finds all shirts by Gucci, regardless of color.
	The `colors` agg returns popular colors for shirts by Gucci.
	The `color_red` agg limits the `models` sub-aggregation to red Gucci shirts.
	Finally, the `post_filter` removes colors other than red from the search `hits`.

Preference

edit

Controls a preference of the shard copies on which to execute the search. By default, Elasticsearch selects from the available shard copies in an unspecified order, taking the allocation awareness and adaptive replica selection configuration into account. However, it may sometimes be desirable to try and route certain searches to certain sets of shard copies.

A possible use case would be to make use of per-copy caches like the request cache. Doing this, however, runs contrary to the idea of search parallelization and can create hotspots on certain nodes because the load might not be evenly distributed anymore.

The preference is a query string parameter which can be set to:

`_only_local`	The operation will be executed only on shards allocated to the local node.
`_local`	The operation will be executed on shards allocated to the local node if possible, and will fall back to other shards if not.
`_prefer_nodes:abc,xyz`	The operation will be executed on nodes with one of the provided node ids (`abc` or `xyz` in this case) if possible. If suitable shard copies exist on more than one of the selected nodes then the order of preference between these copies is unspecified.
`_shards:2,3`	Restricts the operation to the specified shards. (`2` and `3` in this case). This preference can be combined with other preferences but it has to appear first: `_shards:2,3\|_local`
`_only_nodes:abc,xyz,...`	Restricts the operation to nodes specified according to the node specification. If suitable shard copies exist on more than one of the selected nodes then the order of preference between these copies is unspecified.
Custom (string) value	Any value that does not start with `_`. If two searches both give the same custom string value for their preference and the underlying cluster state does not change then the same ordering of shards will be used for the searches. This does not guarantee that the exact same shards will be used each time: the cluster state, and therefore the selected shards, may change for a number of reasons including shard relocations and shard failures, and nodes may sometimes reject searches causing fallbacks to alternative nodes. However, in practice the ordering of shards tends to remain stable for long periods of time. A good candidate for a custom preference value is something like the web session id or the user name.

For instance, use the user’s session ID xyzabc123 as follows:

GET /_search?preference=xyzabc123
{
    "query": {
        "match": {
            "title": "elasticsearch"
        }
    }
}

Copy as curl Try in Elastic

This can be an effective strategy to increase usage of e.g. the request cache for unique users running similar searches repeatedly by always hitting the same cache, while requests of different users are still spread across all shard copies.

The _only_local preference guarantees only to use shard copies on the local node, which is sometimes useful for troubleshooting. All other options do not fully guarantee that any particular shard copies are used in a search, and on a changing index this may mean that repeated searches may yield different results if they are executed on different shard copies which are in different refresh states.

Rescoring

edit

Rescoring can help to improve precision by reordering just the top (eg 100 - 500) documents returned by the query and post_filter phases, using a secondary (usually more costly) algorithm, instead of applying the costly algorithm to all documents in the index.

A rescore request is executed on each shard before it returns its results to be sorted by the node handling the overall search request.

Currently the rescore API has only one implementation: the query rescorer, which uses a query to tweak the scoring. In the future, alternative rescorers may be made available, for example, a pair-wise rescorer.

An error will be thrown if an explicit sort (other than _score in descending order) is provided with a rescore query.

when exposing pagination to your users, you should not change window_size as you step through each page (by passing different from values) since that can alter the top hits causing results to confusingly shift as the user steps through pages.

Query rescorer

edit

The query rescorer executes a second query only on the Top-K results returned by the query and post_filter phases. The number of docs which will be examined on each shard can be controlled by the window_size parameter, which defaults to 10.

By default the scores from the original query and the rescore query are combined linearly to produce the final _score for each document. The relative importance of the original query and of the rescore query can be controlled with the query_weight and rescore_query_weight respectively. Both default to 1.

For example:

POST /_search
{
   "query" : {
      "match" : {
         "message" : {
            "operator" : "or",
            "query" : "the quick brown"
         }
      }
   },
   "rescore" : {
      "window_size" : 50,
      "query" : {
         "rescore_query" : {
            "match_phrase" : {
               "message" : {
                  "query" : "the quick brown",
                  "slop" : 2
               }
            }
         },
         "query_weight" : 0.7,
         "rescore_query_weight" : 1.2
      }
   }
}

Copy as curl Try in Elastic

The way the scores are combined can be controlled with the score_mode:

Score Mode	Description
`total`	Add the original score and the rescore query score. The default.
`multiply`	Multiply the original score by the rescore query score. Useful for `function query` rescores.
`avg`	Average the original score and the rescore query score.
`max`	Take the max of original score and the rescore query score.
`min`	Take the min of the original score and the rescore query score.

Multiple Rescores

edit

It is also possible to execute multiple rescores in sequence:

POST /_search
{
   "query" : {
      "match" : {
         "message" : {
            "operator" : "or",
            "query" : "the quick brown"
         }
      }
   },
   "rescore" : [ {
      "window_size" : 100,
      "query" : {
         "rescore_query" : {
            "match_phrase" : {
               "message" : {
                  "query" : "the quick brown",
                  "slop" : 2
               }
            }
         },
         "query_weight" : 0.7,
         "rescore_query_weight" : 1.2
      }
   }, {
      "window_size" : 10,
      "query" : {
         "score_mode": "multiply",
         "rescore_query" : {
            "function_score" : {
               "script_score": {
                  "script": {
                    "source": "Math.log10(doc.likes.value + 2)"
                  }
               }
            }
         }
      }
   } ]
}

Copy as curl Try in Elastic

The first one gets the results of the query then the second one gets the results of the first, etc. The second rescore will "see" the sorting done by the first rescore so it is possible to use a large window on the first rescore to pull documents into a smaller window for the second rescore.

Script Fields

edit

Allows to return a script evaluation (based on different fields) for each hit, for example:

GET /_search
{
    "query" : {
        "match_all": {}
    },
    "script_fields" : {
        "test1" : {
            "script" : {
                "lang": "painless",
                "source": "doc['price'].value * 2"
            }
        },
        "test2" : {
            "script" : {
                "lang": "painless",
                "source": "doc['price'].value * params.factor",
                "params" : {
                    "factor"  : 2.0
                }
            }
        }
    }
}

Copy as curl Try in Elastic

Script fields can work on fields that are not stored (price in the above case), and allow to return custom values to be returned (the evaluated value of the script).

Script fields can also access the actual _source document and extract specific elements to be returned from it by using params['_source']. Here is an example:

GET /_search
    {
        "query" : {
            "match_all": {}
        },
        "script_fields" : {
            "test1" : {
                "script" : "params['_source']['message']"
            }
        }
    }

Copy as curl Try in Elastic

Note the _source keyword here to navigate the json-like model.

It’s important to understand the difference between doc['my_field'].value and params['_source']['my_field']. The first, using the doc keyword, will cause the terms for that field to be loaded to memory (cached), which will result in faster execution, but more memory consumption. Also, the doc[...] notation only allows for simple valued fields (you can’t return a json object from it) and makes sense only for non-analyzed or single term based fields. However, using doc is still the recommended way to access values from the document, if at all possible, because _source must be loaded and parsed every time it’s used. Using _source is very slow.

Scroll

edit

While a search request returns a single “page” of results, the scroll API can be used to retrieve large numbers of results (or even all results) from a single search request, in much the same way as you would use a cursor on a traditional database.

Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one index into a new index with a different configuration.

The results that are returned from a scroll request reflect the state of the index at the time that the initial search request was made, like a snapshot in time. Subsequent changes to documents (index, update or delete) will only affect later search requests.

In order to use scrolling, the initial search request should specify the scroll parameter in the query string, which tells Elasticsearch how long it should keep the “search context” alive (see Keeping the search context alive), eg ?scroll=1m.

POST /twitter/_search?scroll=1m
{
    "size": 100,
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    }
}

Copy as curl Try in Elastic

The result from the above request includes a _scroll_id, which should be passed to the scroll API in order to retrieve the next batch of results.

POST /_search/scroll 
{
    "scroll" : "1m", 
    "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==" 
}

Copy as curl Try in Elastic

	`GET` or `POST` can be used and the URL should not include the `index` name — this is specified in the original `search` request instead.
	The `scroll` parameter tells Elasticsearch to keep the search context open for another `1m`.
	The `scroll_id` parameter

The size parameter allows you to configure the maximum number of hits to be returned with each batch of results. Each call to the scroll API returns the next batch of results until there are no more results left to return, ie the hits array is empty.

The initial search request and each subsequent scroll request each return a _scroll_id. While the _scroll_id may change between requests, it doesn’t always change — in any case, only the most recently received _scroll_id should be used.

If the request specifies aggregations, only the initial search response will contain the aggregations results.

Scroll requests have optimizations that make them faster when the sort order is _doc. If you want to iterate over all documents regardless of the order, this is the most efficient option:

GET /_search?scroll=1m
{
  "sort": [
    "_doc"
  ]
}

Copy as curl Try in Elastic

Keeping the search context alive

edit

A scroll returns all the documents which matched the search at the time of the initial search request. It ignores any subsequent changes to these documents. The scroll_id identifies a search context which keeps track of everything that Elasticsearch needs to return the correct documents. The search context is created by the initial request and kept alive by subsequent requests.

The scroll parameter (passed to the search request and to every scroll request) tells Elasticsearch how long it should keep the search context alive. Its value (e.g. 1m, see Time units) does not need to be long enough to process all data — it just needs to be long enough to process the previous batch of results. Each scroll request (with the scroll parameter) sets a new expiry time. If a scroll request doesn’t pass in the scroll parameter, then the search context will be freed as part of that scroll request.

Normally, the background merge process optimizes the index by merging together smaller segments to create new, bigger segments. Once the smaller segments are no longer needed they are deleted. This process continues during scrolling, but an open search context prevents the old segments from being deleted since they are still in use.

Keeping older segments alive means that more disk space and file handles are needed. Ensure that you have configured your nodes to have ample free file handles. See File Descriptors.

Additionally, if a segment contains deleted or updated documents then the search context must keep track of whether each document in the segment was live at the time of the initial search request. Ensure that your nodes have sufficient heap space if you have many open scrolls on an index that is subject to ongoing deletes or updates.

To prevent against issues caused by having too many scrolls open, the user is not allowed to open scrolls past a certain limit. By default, the maximum number of open scrolls is 500. This limit can be updated with the search.max_open_scroll_context cluster setting.

You can check how many search contexts are open with the nodes stats API:

GET /_nodes/stats/indices/search

Copy as curl Try in Elastic

Clear scroll API

edit

Search context are automatically removed when the scroll timeout has been exceeded. However keeping scrolls open has a cost, as discussed in the previous section so scrolls should be explicitly cleared as soon as the scroll is not being used anymore using the clear-scroll API:

DELETE /_search/scroll
{
    "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ=="
}

Copy as curl Try in Elastic

Multiple scroll IDs can be passed as array:

DELETE /_search/scroll
{
    "scroll_id" : [
      "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==",
      "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAABFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAAAxZrUllkUVlCa1NqNmRMaUhiQlZkMWFBAAAAAAAAAAIWa1JZZFFZQmtTajZkTGlIYkJWZDFhQQAAAAAAAAAFFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAABBZrUllkUVlCa1NqNmRMaUhiQlZkMWFB"
    ]
}

Copy as curl Try in Elastic

All search contexts can be cleared with the _all parameter:

DELETE /_search/scroll/_all

Copy as curl Try in Elastic

The scroll_id can also be passed as a query string parameter or in the request body. Multiple scroll IDs can be passed as comma separated values:

DELETE /_search/scroll/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==,DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAABFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAAAxZrUllkUVlCa1NqNmRMaUhiQlZkMWFBAAAAAAAAAAIWa1JZZFFZQmtTajZkTGlIYkJWZDFhQQAAAAAAAAAFFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAABBZrUllkUVlCa1NqNmRMaUhiQlZkMWFB

Copy as curl Try in Elastic

Sliced Scroll

edit

For scroll queries that return a lot of documents it is possible to split the scroll in multiple slices which can be consumed independently:

GET /twitter/_search?scroll=1m
{
    "slice": {
        "id": 0, 
        "max": 2 
    },
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    }
}
GET /twitter/_search?scroll=1m
{
    "slice": {
        "id": 1,
        "max": 2
    },
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    }
}

Copy as curl Try in Elastic

	The id of the slice
	The maximum number of slices

The result from the first request returned documents that belong to the first slice (id: 0) and the result from the second request returned documents that belong to the second slice. Since the maximum number of slices is set to 2 the union of the results of the two requests is equivalent to the results of a scroll query without slicing. By default the splitting is done on the shards first and then locally on each shard using the _id field with the following formula: slice(doc) = floorMod(hashCode(doc._id), max) For instance if the number of shards is equal to 2 and the user requested 4 slices then the slices 0 and 2 are assigned to the first shard and the slices 1 and 3 are assigned to the second shard.

Each scroll is independent and can be processed in parallel like any scroll request.

If the number of slices is bigger than the number of shards the slice filter is very slow on the first calls, it has a complexity of O(N) and a memory cost equals to N bits per slice where N is the total number of documents in the shard. After few calls the filter should be cached and subsequent calls should be faster but you should limit the number of sliced query you perform in parallel to avoid the memory explosion.

To avoid this cost entirely it is possible to use the doc_values of another field to do the slicing but the user must ensure that the field has the following properties:

The field is numeric.
doc_values are enabled on that field
Every document should contain a single value. If a document has multiple values for the specified field, the first value is used.
The value for each document should be set once when the document is created and never updated. This ensures that each slice gets deterministic results.
The cardinality of the field should be high. This ensures that each slice gets approximately the same amount of documents.

GET /twitter/_search?scroll=1m
{
    "slice": {
        "field": "date",
        "id": 0,
        "max": 10
    },
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    }
}

Copy as curl Try in Elastic

For append only time-based indices, the timestamp field can be used safely.

By default the maximum number of slices allowed per scroll is limited to 1024. You can update the index.max_slices_per_scroll index setting to bypass this limit.

Search After

edit

Pagination of results can be done by using the from and size but the cost becomes prohibitive when the deep pagination is reached. The index.max_result_window which defaults to 10,000 is a safeguard, search requests take heap memory and time proportional to from + size. The Scroll api is recommended for efficient deep scrolling but scroll contexts are costly and it is not recommended to use it for real time user requests. The search_after parameter circumvents this problem by providing a live cursor. The idea is to use the results from the previous page to help the retrieval of the next page.

Suppose that the query to retrieve the first page looks like this:

GET twitter/_search
{
    "size": 10,
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    },
    "sort": [
        {"date": "asc"},
        {"tie_breaker_id": "asc"}      
    ]
}

Copy as curl Try in Elastic

A copy of the _id field with doc_values enabled

A field with one unique value per document should be used as the tiebreaker of the sort specification. Otherwise the sort order for documents that have the same sort values would be undefined and could lead to missing or duplicate results. The _id field has a unique value per document but it is not recommended to use it as a tiebreaker directly. Beware that search_after looks for the first document which fully or partially matches tiebreaker’s provided value. Therefore if a document has a tiebreaker value of "654323" and you search_after for "654" it would still match that document and return results found after it. doc value are disabled on this field so sorting on it requires to load a lot of data in memory. Instead it is advised to duplicate (client side or with a set ingest processor) the content of the _id field in another field that has doc value enabled and to use this new field as the tiebreaker for the sort.

The result from the above request includes an array of sort values for each document. These sort values can be used in conjunction with the search_after parameter to start returning results "after" any document in the result list. For instance we can use the sort values of the last document and pass it to search_after to retrieve the next page of results:

GET twitter/_search
{
    "size": 10,
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    },
    "search_after": [1463538857, "654323"],
    "sort": [
        {"date": "asc"},
        {"tie_breaker_id": "asc"}
    ]
}

Copy as curl Try in Elastic

The parameter from must be set to 0 (or -1) when search_after is used.

search_after is not a solution to jump freely to a random page but rather to scroll many queries in parallel. It is very similar to the scroll API but unlike it, the search_after parameter is stateless, it is always resolved against the latest version of the searcher. For this reason the sort order may change during a walk depending on the updates and deletes of your index.

Search Type

edit

There are different execution paths that can be done when executing a distributed search. The distributed search operation needs to be scattered to all the relevant shards and then all the results are gathered back. When doing scatter/gather type execution, there are several ways to do that, specifically with search engines.

One of the questions when executing a distributed search is how many results to retrieve from each shard. For example, if we have 10 shards, the 1st shard might hold the most relevant results from 0 till 10, with other shards results ranking below it. For this reason, when executing a request, we will need to get results from 0 till 10 from all shards, sort them, and then return the results if we want to ensure correct results.

Another question, which relates to the search engine, is the fact that each shard stands on its own. When a query is executed on a specific shard, it does not take into account term frequencies and other search engine information from the other shards. If we want to support accurate ranking, we would need to first gather the term frequencies from all shards to calculate global term frequencies, then execute the query on each shard using these global frequencies.

Also, because of the need to sort the results, getting back a large document set, or even scrolling it, while maintaining the correct sorting behavior can be a very expensive operation. For large result set scrolling, it is best to sort by _doc if the order in which documents are returned is not important.

Elasticsearch is very flexible and allows to control the type of search to execute on a per search request basis. The type can be configured by setting the search_type parameter in the query string. The types are:

Query Then Fetch

edit

Parameter value: query_then_fetch.

The request is processed in two phases. In the first phase, the query is forwarded to all involved shards. Each shard executes the search and generates a sorted list of results, local to that shard. Each shard returns just enough information to the coordinating node to allow it to merge and re-sort the shard level results into a globally sorted set of results, of maximum length size.

During the second phase, the coordinating node requests the document content (and highlighted snippets, if any) from only the relevant shards.

GET twitter/_search?search_type=query_then_fetch

Copy as curl Try in Elastic

This is the default setting, if you do not specify a search_type in your request.

Dfs, Query Then Fetch

edit

Parameter value: dfs_query_then_fetch.

Same as "Query Then Fetch", except for an initial scatter phase which goes and computes the distributed term frequencies for more accurate scoring.

GET twitter/_search?search_type=dfs_query_then_fetch

Copy as curl Try in Elastic

Sort

edit

Allows you to add one or more sorts on specific fields. Each sort can be reversed as well. The sort is defined on a per field level, with special field name for _score to sort by score, and _doc to sort by index order.

Assuming the following index mapping:

PUT /my_index
{
    "mappings": {
        "properties": {
            "post_date": { "type": "date" },
            "user": {
                "type": "keyword"
            },
            "name": {
                "type": "keyword"
            },
            "age": { "type": "integer" }
        }
    }
}

Copy as curl Try in Elastic

GET /my_index/_search
{
    "sort" : [
        { "post_date" : {"order" : "asc"}},
        "user",
        { "name" : "desc" },
        { "age" : "desc" },
        "_score"
    ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Copy as curl Try in Elastic

_doc has no real use-case besides being the most efficient sort order. So if you don’t care about the order in which documents are returned, then you should sort by _doc. This especially helps when scrolling.

Sort Values

edit

The sort values for each document returned are also returned as part of the response.

Sort Order

edit

The order option can have the following values:

`asc`	Sort in ascending order
`desc`	Sort in descending order

The order defaults to desc when sorting on the _score, and defaults to asc when sorting on anything else.

Sort mode option

edit

Elasticsearch supports sorting by array or multi-valued fields. The mode option controls what array value is picked for sorting the document it belongs to. The mode option can have the following values:

`min`	Pick the lowest value.
`max`	Pick the highest value.
`sum`	Use the sum of all values as sort value. Only applicable for number based array fields.
`avg`	Use the average of all values as sort value. Only applicable for number based array fields.
`median`	Use the median of all values as sort value. Only applicable for number based array fields.

The default sort mode in the ascending sort order is min — the lowest value is picked. The default sort mode in the descending order is max — the highest value is picked.

Sort mode example usage

edit

In the example below the field price has multiple prices per document. In this case the result hits will be sorted by price ascending based on the average price per document.

PUT /my_index/_doc/1?refresh
{
   "product": "chocolate",
   "price": [20, 4]
}

POST /_search
{
   "query" : {
      "term" : { "product" : "chocolate" }
   },
   "sort" : [
      {"price" : {"order" : "asc", "mode" : "avg"}}
   ]
}

Copy as curl Try in Elastic

Sorting numeric fields

edit

For numeric fields it is also possible to cast the values from one type to another using the numeric_type option. This option accepts the following values: ["double", "long", "date", "date_nanos"] and can be useful for cross-index search if the sort field is mapped differently on some indices.

Consider for instance these two indices:

PUT /index_double
{
    "mappings": {
        "properties": {
            "field": { "type": "double" }
        }
    }
}

Copy as curl Try in Elastic

PUT /index_long
{
    "mappings": {
        "properties": {
            "field": { "type": "long" }
        }
    }
}

Copy as curl Try in Elastic

Since field is mapped as a double in the first index and as a long in the second index, it is not possible to use this field to sort requests that query both indices by default. However you can force the type to one or the other with the numeric_type option in order to force a specific type for all indices:

POST /index_long,index_double/_search
{
   "sort" : [
      {
        "field" : {
            "numeric_type" : "double"
        }
      }
   ]
}

Copy as curl Try in Elastic

In the example above, values for the index_long index are casted to a double in order to be compatible with the values produced by the index_double index. It is also possible to transform a floating point field into a long but note that in this case floating points are replaced by the largest value that is less than or equal (greater than or equal if the value is negative) to the argument and is equal to a mathematical integer.

This option can also be used to convert a date field that uses millisecond resolution to a date_nanos field with nanosecond resolution. Consider for instance these two indices:

PUT /index_double
{
    "mappings": {
        "properties": {
            "field": { "type": "date" }
        }
    }
}

Copy as curl Try in Elastic

PUT /index_long
{
    "mappings": {
        "properties": {
            "field": { "type": "date_nanos" }
        }
    }
}

Copy as curl Try in Elastic

Values in these indices are stored with different resolutions so sorting on these fields will always sort the date before the date_nanos (ascending order). With the numeric_type type option it is possible to set a single resolution for the sort, setting to date will convert the date_nanos to the millisecond resolution while date_nanos will convert the values in the date field to the nanoseconds resolution:

POST /index_long,index_double/_search
{
   "sort" : [
      {
        "field" : {
            "numeric_type" : "date_nanos"
        }
      }
   ]
}

Copy as curl Try in Elastic

To avoid overflow, the conversion to date_nanos cannot be applied on dates before 1970 and after 2262 as nanoseconds are represented as longs.

Sorting within nested objects.

edit

Elasticsearch also supports sorting by fields that are inside one or more nested objects. The sorting by nested field support has a nested sort option with the following properties:

path: Defines on which nested object to sort. The actual sort field must be a direct field inside this nested object. When sorting by nested field, this field is mandatory.
filter: A filter that the inner objects inside the nested path should match with in order for its field values to be taken into account by sorting. Common case is to repeat the query / filter inside the nested filter or query. By default no nested_filter is active.
max_children: The maximum number of children to consider per root document when picking the sort value. Defaults to unlimited.
nested: Same as top-level nested but applies to another nested path within the current nested object.

Nested sort options before Elasticsearch 6.1

The nested_path and nested_filter options have been deprecated in favor of the options documented above.

Nested sorting examples

edit

In the below example offer is a field of type nested. The nested path needs to be specified; otherwise, Elasticsearch doesn’t know on what nested level sort values need to be captured.

POST /_search
{
   "query" : {
      "term" : { "product" : "chocolate" }
   },
   "sort" : [
       {
          "offer.price" : {
             "mode" :  "avg",
             "order" : "asc",
             "nested": {
                "path": "offer",
                "filter": {
                   "term" : { "offer.color" : "blue" }
                }
             }
          }
       }
    ]
}

Copy as curl Try in Elastic

In the below example parent and child fields are of type nested. The nested_path needs to be specified at each level; otherwise, Elasticsearch doesn’t know on what nested level sort values need to be captured.

POST /_search
{
   "query": {
      "nested": {
         "path": "parent",
         "query": {
            "bool": {
                "must": {"range": {"parent.age": {"gte": 21}}},
                "filter": {
                    "nested": {
                        "path": "parent.child",
                        "query": {"match": {"parent.child.name": "matt"}}
                    }
                }
            }
         }
      }
   },
   "sort" : [
      {
         "parent.child.age" : {
            "mode" :  "min",
            "order" : "asc",
            "nested": {
               "path": "parent",
               "filter": {
                  "range": {"parent.age": {"gte": 21}}
               },
               "nested": {
                  "path": "parent.child",
                  "filter": {
                     "match": {"parent.child.name": "matt"}
                  }
               }
            }
         }
      }
   ]
}

Copy as curl Try in Elastic

Nested sorting is also supported when sorting by scripts and sorting by geo distance.

Missing Values

edit

The missing parameter specifies how docs which are missing the sort field should be treated: The missing value can be set to _last, _first, or a custom value (that will be used for missing docs as the sort value). The default is _last.

For example:

GET /_search
{
    "sort" : [
        { "price" : {"missing" : "_last"} }
    ],
    "query" : {
        "term" : { "product" : "chocolate" }
    }
}

Copy as curl Try in Elastic

If a nested inner object doesn’t match with the nested_filter then a missing value is used.

Ignoring Unmapped Fields

edit

By default, the search request will fail if there is no mapping associated with a field. The unmapped_type option allows you to ignore fields that have no mapping and not sort by them. The value of this parameter is used to determine what sort values to emit. Here is an example of how it can be used:

GET /_search
{
    "sort" : [
        { "price" : {"unmapped_type" : "long"} }
    ],
    "query" : {
        "term" : { "product" : "chocolate" }
    }
}

Copy as curl Try in Elastic

If any of the indices that are queried doesn’t have a mapping for price then Elasticsearch will handle it as if there was a mapping of type long, with all documents in this index having no value for this field.

Geo Distance Sorting

edit

Allow to sort by _geo_distance. Here is an example, assuming pin.location is a field of type geo_point:

GET /_search
{
    "sort" : [
        {
            "_geo_distance" : {
                "pin.location" : [-70, 40],
                "order" : "asc",
                "unit" : "km",
                "mode" : "min",
                "distance_type" : "arc",
                "ignore_unmapped": true
            }
        }
    ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Copy as curl Try in Elastic

distance_type: How to compute the distance. Can either be arc (default), or plane (faster, but inaccurate on long distances and close to the poles).
mode: What to do in case a field has several geo points. By default, the shortest distance is taken into account when sorting in ascending order and the longest distance when sorting in descending order. Supported values are min, max, median and avg.
unit: The unit to use when computing sort values. The default is m (meters).
ignore_unmapped: Indicates if the unmapped field should be treated as a missing value. Setting it to true is equivalent to specifying an unmapped_type in the field sort. The default is false (unmapped field cause the search to fail).

geo distance sorting does not support configurable missing values: the distance will always be considered equal to Infinity when a document does not have values for the field that is used for distance computation.

The following formats are supported in providing the coordinates:

Lat Lon as Properties

edit

GET /_search
{
    "sort" : [
        {
            "_geo_distance" : {
                "pin.location" : {
                    "lat" : 40,
                    "lon" : -70
                },
                "order" : "asc",
                "unit" : "km"
            }
        }
    ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Copy as curl Try in Elastic

Lat Lon as String

edit

Format in lat,lon.

GET /_search
{
    "sort" : [
        {
            "_geo_distance" : {
                "pin.location" : "40,-70",
                "order" : "asc",
                "unit" : "km"
            }
        }
    ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Copy as curl Try in Elastic

Geohash

edit

GET /_search
{
    "sort" : [
        {
            "_geo_distance" : {
                "pin.location" : "drm3btev3e86",
                "order" : "asc",
                "unit" : "km"
            }
        }
    ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Copy as curl Try in Elastic

Lat Lon as Array

edit

Format in [lon, lat], note, the order of lon/lat here in order to conform with GeoJSON.

GET /_search
{
    "sort" : [
        {
            "_geo_distance" : {
                "pin.location" : [-70, 40],
                "order" : "asc",
                "unit" : "km"
            }
        }
    ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Copy as curl Try in Elastic

Multiple reference points

edit

Multiple geo points can be passed as an array containing any geo_point format, for example

GET /_search
{
    "sort" : [
        {
            "_geo_distance" : {
                "pin.location" : [[-70, 40], [-71, 42]],
                "order" : "asc",
                "unit" : "km"
            }
        }
    ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Copy as curl Try in Elastic

and so forth.

The final distance for a document will then be min/max/avg (defined via mode) distance of all points contained in the document to all points given in the sort request.

Script Based Sorting

edit

Allow to sort based on custom scripts, here is an example:

GET /_search
{
    "query" : {
        "term" : { "user" : "kimchy" }
    },
    "sort" : {
        "_script" : {
            "type" : "number",
            "script" : {
                "lang": "painless",
                "source": "doc['field_name'].value * params.factor",
                "params" : {
                    "factor" : 1.1
                }
            },
            "order" : "asc"
        }
    }
}

Copy as curl Try in Elastic

Track Scores

edit

When sorting on a field, scores are not computed. By setting track_scores to true, scores will still be computed and tracked.

GET /_search
{
    "track_scores": true,
    "sort" : [
        { "post_date" : {"order" : "desc"} },
        { "name" : "desc" },
        { "age" : "desc" }
    ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Copy as curl Try in Elastic

Memory Considerations

edit

When sorting, the relevant sorted field values are loaded into memory. This means that per shard, there should be enough memory to contain them. For string based types, the field sorted on should not be analyzed / tokenized. For numeric types, if possible, it is recommended to explicitly set the type to narrower types (like short, integer and float).

Source filtering

edit

See source filtering.

Stored Fields

edit

The stored_fields parameter is about fields that are explicitly marked as stored in the mapping, which is off by default and generally not recommended. Use source filtering instead to select subsets of the original source document to be returned.

Allows to selectively load specific stored fields for each document represented by a search hit.

GET /_search
{
    "stored_fields" : ["user", "postDate"],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Copy as curl Try in Elastic

* can be used to load all stored fields from the document.

An empty array will cause only the _id and _type for each hit to be returned, for example:

GET /_search
{
    "stored_fields" : [],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Copy as curl Try in Elastic

If the requested fields are not stored (store mapping set to false), they will be ignored.

Stored field values fetched from the document itself are always returned as an array. On the contrary, metadata fields like _routing are never returned as an array.

Also only leaf fields can be returned via the stored_fields option. If an object field is specified, it will be ignored.

On its own, stored_fields cannot be used to load fields in nested objects — if a field contains a nested object in its path, then no data will be returned for that stored field. To access nested fields, stored_fields must be used within an inner_hits block.

Disable stored fields entirely

edit

To disable the stored fields (and metadata fields) entirely use: _none_:

GET /_search
{
    "stored_fields": "_none_",
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Copy as curl Try in Elastic

_source and version parameters cannot be activated if _none_ is used.

Track total hits

edit

Generally the total hit count can’t be computed accurately without visiting all matches, which is costly for queries that match lots of documents. The track_total_hits parameter allows you to control how the total number of hits should be tracked. Given that it is often enough to have a lower bound of the number of hits, such as "there are at least 10000 hits", the default is set to 10,000. This means that requests will count the total hit accurately up to 10,000 hits. It’s is a good trade off to speed up searches if you don’t need the accurate number of hits after a certain threshold.

When set to true the search response will always track the number of hits that match the query accurately (e.g. total.relation will always be equal to "eq" when track_total_hits is set to true). Otherwise the "total.relation" returned in the "total" object in the search response determines how the "total.value" should be interpreted. A value of "gte" means that the "total.value" is a lower bound of the total hits that match the query and a value of "eq" indicates that "total.value" is the accurate count.

GET twitter/_search
{
    "track_total_hits": true,
     "query": {
        "match" : {
            "message" : "Elasticsearch"
        }
     }
}

Copy as curl Try in Elastic

... returns:

{
    "_shards": ...
    "timed_out": false,
    "took": 100,
    "hits": {
        "max_score": 1.0,
        "total" : {
            "value": 2048,    
            "relation": "eq"  
        },
        "hits": ...
    }
}

	The total number of hits that match the query.
	The count is accurate (e.g. `"eq"` means equals).

It is also possible to set track_total_hits to an integer. For instance the following query will accurately track the total hit count that match the query up to 100 documents:

GET twitter/_search
{
    "track_total_hits": 100,
     "query": {
        "match" : {
            "message" : "Elasticsearch"
        }
     }
}

Copy as curl Try in Elastic

The hits.total.relation in the response will indicate if the value returned in hits.total.value is accurate ("eq") or a lower bound of the total ("gte").

For instance the following response:

{
    "_shards": ...
    "timed_out": false,
    "took": 30,
    "hits" : {
        "max_score": 1.0,
        "total" : {
            "value": 42,         
            "relation": "eq"     
        },
        "hits": ...
    }
}

	42 documents match the query
	and the count is accurate (`"eq"`)

... indicates that the number of hits returned in the total is accurate.

If the total number of hits that match the query is greater than the value set in track_total_hits, the total hits in the response will indicate that the returned value is a lower bound:

{
    "_shards": ...
    "hits" : {
        "max_score": 1.0,
        "total" : {
            "value": 100,         
            "relation": "gte"     
        },
        "hits": ...
    }
}

	There are at least 100 documents that match the query
	This is a lower bound (`"gte"`).

If you don’t need to track the total number of hits at all you can improve query times by setting this option to false:

GET twitter/_search
{
    "track_total_hits": false,
     "query": {
        "match" : {
            "message" : "Elasticsearch"
        }
     }
}

Copy as curl Try in Elastic

... returns:

{
    "_shards": ...
    "timed_out": false,
    "took": 10,
    "hits" : { 
        "max_score": 1.0,
        "hits": ...
    }
}

The total number of hits is unknown.

Finally you can force an accurate count by setting "track_total_hits" to true in the request.

« Search API Async search »

On this page

Request
Description
Path parameters
Request body
Fast check for any matching docs
Doc value fields
Field Collapsing
Expand collapse results
Second level of collapsing
Highlighting
Unified highlighter
Plain highlighter
Fast vector highlighter
Offsets Strategy
Highlighting Settings
Highlighting Examples
Override global settings
Specify a highlight query
Set highlighter type
Configure highlighting tags
Highlight on source
Highlight in all fields
Combine matches on multiple fields
Explicitly order highlighted fields
Control highlighted fragments
Highlight using the postings list
Specify a fragmenter for the plain highlighter
How highlighters work internally
How to break a text into fragments?
How to find the best fragments?
How to highlight the query terms in a fragment?
An example of the work of the unified highlighter
Index Boost
Inner hits
Options
Nested inner hits
Nested inner hits and _source
Hierarchical levels of nested object fields and inner hits.
Parent/child inner hits
min_score
Named Queries
Post filter
Preference
Rescoring
Query rescorer
Multiple Rescores
Script Fields
Scroll
Keeping the search context alive
Clear scroll API
Sliced Scroll
Search After
Search Type
Query Then Fetch
Dfs, Query Then Fetch
Sort
Sort Values
Sort Order
Sort mode option
Sort mode example usage
Sorting numeric fields
Sorting within nested objects.
Nested sorting examples
Missing Values
Ignoring Unmapped Fields
Geo Distance Sorting
Lat Lon as Properties
Lat Lon as String
Geohash
Lat Lon as Array
Multiple reference points
Script Based Sorting
Track Scores
Memory Considerations
Source filtering
Stored Fields
Disable stored fields entirely
Track total hits

Was this helpful?

Feedback

The Search AI Company

Generative AI

Search

Security

Observability

By solution

Industries

Request Body Search

Request Body Search

Request

Description

Path parameters

Request body

Fast check for any matching docs

Doc value fields

Field Collapsing

Expand collapse results

Second level of collapsing

Highlighting

Unified highlighter

Plain highlighter

Fast vector highlighter

Offsets Strategy

Highlighting Settings

Highlighting Examples

Override global settings

Specify a highlight query

Set highlighter type

Configure highlighting tags

Highlight on source

Highlight in all fields

Combine matches on multiple fields

Explicitly order highlighted fields

Control highlighted fragments

Highlight using the postings list

Specify a fragmenter for the plain highlighter

How highlighters work internally

How to break a text into fragments?

How to find the best fragments?

How to highlight the query terms in a fragment?

An example of the work of the unified highlighter

Index Boost

Inner hits

Options

Nested inner hits

Nested inner hits and _source

Hierarchical levels of nested object fields and inner hits.

Parent/child inner hits

min_score

Named Queries

Post filter

Preference

Rescoring

Query rescorer

Multiple Rescores

Script Fields

Scroll

Keeping the search context alive

Clear scroll API

Sliced Scroll

Search After

Search Type

Query Then Fetch

Dfs, Query Then Fetch

Sort

Sort Values

Sort Order

Sort mode option

Sort mode example usage

Sorting numeric fields

Sorting within nested objects.

Nested sorting examples

Missing Values

Ignoring Unmapped Fields

Geo Distance Sorting

Lat Lon as Properties

Lat Lon as String

Geohash

Lat Lon as Array

Multiple reference points

Nested inner hits and `_source`