Request Body Search

edit

Specifies search criteria as request body parameters.

GET /twitter/_search
{
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Request

edit

GET /<index>/_search { "query": {<parameters>} }

Description

edit

The search request can be executed with a search DSL, which includes the Query DSL, within its body.

Path parameters

edit
<index>
(Optional, string) Comma-separated list or wildcard expression of index names used to limit the request.

Request body

edit
allow_partial_search_results
(Optional, boolean) Set to false to fail the request if only partial results are available. Defaults to true, which returns partial results in the event of timeouts or partial failures You can override the default behavior for all requests by setting search.default_allow_partial_results to false in the cluster settings.
batched_reduce_size
(Optional, integer) The number of shard results that should be reduced at once on the coordinating node. This value should be used as a protection mechanism to reduce the memory overhead per search request if the potential number of shards in the request can be large.
ccs_minimize_roundtrips
(Optional, boolean) If true, the network round-trips between the coordinating node and the remote clusters ewill be minimized when executing cross-cluster search requests. See Cross-cluster search reduction for more. Defaults to true.
from
(Optional, integer) Starting document offset. Defaults to 0.
request_cache
(Optional, boolean) If true, the caching of search results is enabled for requests where size is 0. See Shard request cache.
search_type

(Optional, string) The type of the search operation. Available options:

  • query_then_fetch
  • dfs_query_then_fetch
size
(Optional, integer) The number of hits to return. Defaults to 10.
terminate_after
(Optional, integer) The maximum number of documents to collect for each shard, upon reaching which the query execution will terminate early.
timeout
(Optional, time units) Explicit timeout for each search request. Defaults to no timeout.

Out of the above, the search_type, request_cache and the allow_partial_search_results settings must be passed as query-string parameters. The rest of the search request should be passed within the body itself. The body content can also be passed as a REST parameter named source.

Both HTTP GET and HTTP POST can be used to execute search with body. Since not all clients support GET with body, POST is allowed as well.

Examples

edit
GET /twitter/_search
{
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

The API returns the following response:

{
    "took": 1,
    "timed_out": false,
    "_shards":{
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
    },
    "hits":{
        "total" : {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.3862944,
        "hits" : [
            {
                "_index" : "twitter",
                "_type" : "_doc",
                "_id" : "0",
                "_score": 1.3862944,
                "_source" : {
                    "user" : "kimchy",
                    "message": "trying out Elasticsearch",
                    "date" : "2009-11-15T14:12:12",
                    "likes" : 0
                }
            }
        ]
    }
}

Fast check for any matching docs

edit

terminate_after is always applied after the post_filter and stops the query as well as the aggregation executions when enough hits have been collected on the shard. Though the doc count on aggregations may not reflect the hits.total in the response since aggregations are applied before the post filtering.

In case we only want to know if there are any documents matching a specific query, we can set the size to 0 to indicate that we are not interested in the search results. Also we can set terminate_after to 1 to indicate that the query execution can be terminated whenever the first matching document was found (per shard).

GET /_search?q=message:number&size=0&terminate_after=1

The response will not contain any hits as the size was set to 0. The hits.total will be either equal to 0, indicating that there were no matching documents, or greater than 0 meaning that there were at least as many documents matching the query when it was early terminated. Also if the query was terminated early, the terminated_early flag will be set to true in the response.

{
  "took": 3,
  "timed_out": false,
  "terminated_early": true,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped" : 0,
    "failed": 0
  },
  "hits": {
    "total" : {
        "value": 1,
        "relation": "eq"
    },
    "max_score": null,
    "hits": []
  }
}

The took time in the response contains the milliseconds that this request took for processing, beginning quickly after the node received the query, up until all search related work is done and before the above JSON is returned to the client. This means it includes the time spent waiting in thread pools, executing a distributed search across the whole cluster and gathering all the results.

Doc value Fields

edit

Allows to return the doc value representation of a field for each hit, for example:

GET /_search
{
    "query" : {
        "match_all": {}
    },
    "docvalue_fields" : [
        "my_ip_field", 
        {
            "field": "my_keyword_field" 
        },
        {
            "field": "my_date_field",
            "format": "epoch_millis" 
        }
    ]
}

the name of the field

an object notation is supported as well

the object notation allows to specify a custom format

Doc value fields can work on fields that have doc-values enabled, regardless of whether they are stored

* can be used as a wild card, for example:

GET /_search
{
    "query" : {
        "match_all": {}
    },
    "docvalue_fields" : [
        {
            "field": "*_date_field", 
            "format": "epoch_millis" 
        }
    ]
}

Match all fields ending with field

Format to be applied to all matching fields.

Note that if the fields parameter specifies fields without docvalues it will try to load the value from the fielddata cache causing the terms for that field to be loaded to memory (cached), which will result in more memory consumption.

Custom formats
edit

While most fields do not support custom formats, some of them do:

By default fields are formatted based on a sensible configuration that depends on their mappings: long, double and other numeric fields are formatted as numbers, keyword fields are formatted as strings, date fields are formatted with the configured date format, etc.

On its own, docvalue_fields cannot be used to load fields in nested objects — if a field contains a nested object in its path, then no data will be returned for that docvalue field. To access nested fields, docvalue_fields must be used within an inner_hits block.

Explain

edit

Enables explanation for each hit on how its score was computed.

GET /_search
{
    "explain": true,
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Field Collapsing

edit

Allows to collapse search results based on field values. The collapsing is done by selecting only the top sorted document per collapse key. For instance the query below retrieves the best tweet for each user and sorts them by number of likes.

GET /twitter/_search
{
    "query": {
        "match": {
            "message": "elasticsearch"
        }
    },
    "collapse" : {
        "field" : "user" 
    },
    "sort": ["likes"], 
    "from": 10 
}

collapse the result set using the "user" field

sort the top docs by number of likes

define the offset of the first collapsed result

The total number of hits in the response indicates the number of matching documents without collapsing. The total number of distinct group is unknown.

The field used for collapsing must be a single valued keyword or numeric field with doc_values activated

The collapsing is applied to the top hits only and does not affect aggregations.

Expand collapse results

edit

It is also possible to expand each collapsed top hits with the inner_hits option.

GET /twitter/_search
{
    "query": {
        "match": {
            "message": "elasticsearch"
        }
    },
    "collapse" : {
        "field" : "user", 
        "inner_hits": {
            "name": "last_tweets", 
            "size": 5, 
            "sort": [{ "date": "asc" }] 
        },
        "max_concurrent_group_searches": 4 
    },
    "sort": ["likes"]
}

collapse the result set using the "user" field

the name used for the inner hit section in the response

the number of inner_hits to retrieve per collapse key

how to sort the document inside each group

the number of concurrent requests allowed to retrieve the inner_hits` per group

See inner hits for the complete list of supported options and the format of the response.

It is also possible to request multiple inner_hits for each collapsed hit. This can be useful when you want to get multiple representations of the collapsed hits.

GET /twitter/_search
{
    "query": {
        "match": {
            "message": "elasticsearch"
        }
    },
    "collapse" : {
        "field" : "user", 
        "inner_hits": [
            {
                "name": "most_liked",  
                "size": 3,
                "sort": ["likes"]
            },
            {
                "name": "most_recent", 
                "size": 3,
                "sort": [{ "date": "asc" }]
            }
        ]
    },
    "sort": ["likes"]
}

collapse the result set using the "user" field

return the three most liked tweets for the user

return the three most recent tweets for the user

The expansion of the group is done by sending an additional query for each inner_hit request for each collapsed hit returned in the response. This can significantly slow things down if you have too many groups and/or inner_hit requests.

The max_concurrent_group_searches request parameter can be used to control the maximum number of concurrent searches allowed in this phase. The default is based on the number of data nodes and the default search thread pool size.

collapse cannot be used in conjunction with scroll, rescore or search after.

Second level of collapsing

edit

Second level of collapsing is also supported and is applied to inner_hits. For example, the following request finds the top scored tweets for each country, and within each country finds the top scored tweets for each user.

GET /twitter/_search
{
    "query": {
        "match": {
            "message": "elasticsearch"
        }
    },
    "collapse" : {
        "field" : "country",
        "inner_hits" : {
            "name": "by_location",
            "collapse" : {"field" : "user"},
            "size": 3
        }
    }
}

Response:

{
    ...
    "hits": [
        {
            "_index": "twitter",
            "_type": "_doc",
            "_id": "9",
            "_score": ...,
            "_source": {...},
            "fields": {"country": ["UK"]},
            "inner_hits":{
                "by_location": {
                    "hits": {
                       ...,
                       "hits": [
                          {
                            ...
                            "fields": {"user" : ["user124"]}
                          },
                          {
                            ...
                            "fields": {"user" : ["user589"]}
                          },
                          {
                            ...
                             "fields": {"user" : ["user001"]}
                          }
                       ]
                    }
                 }
            }
        },
        {
            "_index": "twitter",
            "_type": "_doc",
            "_id": "1",
            "_score": ..,
            "_source": {...},
            "fields": {"country": ["Canada"]},
            "inner_hits":{
                "by_location": {
                    "hits": {
                       ...,
                       "hits": [
                          {
                            ...
                            "fields": {"user" : ["user444"]}
                          },
                          {
                            ...
                            "fields": {"user" : ["user1111"]}
                          },
                          {
                            ...
                             "fields": {"user" : ["user999"]}
                          }
                       ]
                    }
                 }
            }

        },
        ....
    ]
}

Second level of collapsing doesn’t allow inner_hits.

From / Size

edit

Pagination of results can be done by using the from and size parameters. The from parameter defines the offset from the first result you want to fetch. The size parameter allows you to configure the maximum amount of hits to be returned.

Though from and size can be set as request parameters, they can also be set within the search body. from defaults to 0, and size defaults to 10.

GET /_search
{
    "from" : 0, "size" : 10,
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Note that from + size can not be more than the index.max_result_window index setting which defaults to 10,000.

Elasticsearch uses Lucene’s internal doc IDs as tie-breakers. As internal doc IDs might be completely different across replicas of the same data, you may occaisionally see documents with the same sort values are not consistently ordered when using pagination. Thus, it is more efficient to use the Scroll or Search After APIs for deep scrolling.

Highlighting

edit

Highlighters enable you to get highlighted snippets from one or more fields in your search results so you can show users where the query matches are. When you request highlights, the response contains an additional highlight element for each search hit that includes the highlighted fields and the highlighted fragments.

Highlighters don’t reflect the boolean logic of a query when extracting terms to highlight. Thus, for some complex boolean queries (e.g nested boolean queries, queries using minimum_should_match etc.), parts of documents may be highlighted that don’t correspond to query matches.

Highlighting requires the actual content of a field. If the field is not stored (the mapping does not set store to true), the actual _source is loaded and the relevant field is extracted from _source.

For example, to get highlights for the content field in each search hit using the default highlighter, include a highlight object in the request body that specifies the content field:

GET /_search
{
    "query" : {
        "match": { "content": "kimchy" }
    },
    "highlight" : {
        "fields" : {
            "content" : {}
        }
    }
}

Elasticsearch supports three highlighters: unified, plain, and fvh (fast vector highlighter). You can specify the highlighter type you want to use for each field.

Unified highlighter

edit

The unified highlighter uses the Lucene Unified Highlighter. This highlighter breaks the text into sentences and uses the BM25 algorithm to score individual sentences as if they were documents in the corpus. It also supports accurate phrase and multi-term (fuzzy, prefix, regex) highlighting. This is the default highlighter.

Plain highlighter

edit

The plain highlighter uses the standard Lucene highlighter. It attempts to reflect the query matching logic in terms of understanding word importance and any word positioning criteria in phrase queries.

The plain highlighter works best for highlighting simple query matches in a single field. To accurately reflect query logic, it creates a tiny in-memory index and re-runs the original query criteria through Lucene’s query execution planner to get access to low-level match information for the current document. This is repeated for every field and every document that needs to be highlighted. If you want to highlight a lot of fields in a lot of documents with complex queries, we recommend using the unified highlighter on postings or term_vector fields.

Fast vector highlighter

edit

The fvh highlighter uses the Lucene Fast Vector highlighter. This highlighter can be used on fields with term_vector set to with_positions_offsets in the mapping. The fast vector highlighter:

  • Can be customized with a boundary_scanner.
  • Requires setting term_vector to with_positions_offsets which increases the size of the index
  • Can combine matches from multiple fields into one result. See matched_fields
  • Can assign different weights to matches at different positions allowing for things like phrase matches being sorted above term matches when highlighting a Boosting Query that boosts phrase matches over term matches

The fvh highlighter does not support span queries. If you need support for span queries, try an alternative highlighter, such as the unified highlighter.

Offsets Strategy

edit

To create meaningful search snippets from the terms being queried, the highlighter needs to know the start and end character offsets of each word in the original text. These offsets can be obtained from:

  • The postings list. If index_options is set to offsets in the mapping, the unified highlighter uses this information to highlight documents without re-analyzing the text. It re-runs the original query directly on the postings and extracts the matching offsets from the index, limiting the collection to the highlighted documents. This is important if you have large fields because it doesn’t require reanalyzing the text to be highlighted. It also requires less disk space than using term_vectors.
  • Term vectors. If term_vector information is provided by setting term_vector to with_positions_offsets in the mapping, the unified highlighter automatically uses the term_vector to highlight the field. It’s fast especially for large fields (> 1MB) and for highlighting multi-term queries like prefix or wildcard because it can access the dictionary of terms for each document. The fvh highlighter always uses term vectors.
  • Plain highlighting. This mode is used by the unified when there is no other alternative. It creates a tiny in-memory index and re-runs the original query criteria through Lucene’s query execution planner to get access to low-level match information on the current document. This is repeated for every field and every document that needs highlighting. The plain highlighter always uses plain highlighting.

Plain highlighting for large texts may require substantial amount of time and memory. To protect against this, the maximum number of text characters that will be analyzed has been limited to 1000000. This default limit can be changed for a particular index with the index setting index.highlight.max_analyzed_offset.

Highlighting Settings

edit

Highlighting settings can be set on a global level and overridden at the field level.

boundary_chars
A string that contains each boundary character. Defaults to .,!? \t\n.
boundary_max_scan
How far to scan for boundary characters. Defaults to 20.
boundary_scanner

Specifies how to break the highlighted fragments: chars, sentence, or word. Only valid for the unified and fvh highlighters. Defaults to sentence for the unified highlighter. Defaults to chars for the fvh highlighter.

chars
Use the characters specified by boundary_chars as highlighting boundaries. The boundary_max_scan setting controls how far to scan for boundary characters. Only valid for the fvh highlighter.
sentence

Break highlighted fragments at the next sentence boundary, as determined by Java’s BreakIterator. You can specify the locale to use with boundary_scanner_locale.

When used with the unified highlighter, the sentence scanner splits sentences bigger than fragment_size at the first word boundary next to fragment_size. You can set fragment_size to 0 to never split any sentence.

word
Break highlighted fragments at the next word boundary, as determined by Java’s BreakIterator. You can specify the locale to use with boundary_scanner_locale.
boundary_scanner_locale
Controls which locale is used to search for sentence and word boundaries. This parameter takes a form of a language tag, e.g. "en-US", "fr-FR", "ja-JP". More info can be found in the Locale Language Tag documentation. The default value is Locale.ROOT.
encoder
Indicates if the snippet should be HTML encoded: default (no encoding) or html (HTML-escape the snippet text and then insert the highlighting tags)
fields

Specifies the fields to retrieve highlights for. You can use wildcards to specify fields. For example, you could specify comment_* to get highlights for all text and keyword fields that start with comment_.

Only text and keyword fields are highlighted when you use wildcards. If you use a custom mapper and want to highlight on a field anyway, you must explicitly specify that field name.

force_source
Highlight based on the source even if the field is stored separately. Defaults to false.
fragmenter

Specifies how text should be broken up in highlight snippets: simple or span. Only valid for the plain highlighter. Defaults to span.

simple
Breaks up text into same-sized fragments.
span
Breaks up text into same-sized fragments, but tries to avoid breaking up text between highlighted terms. This is helpful when you’re querying for phrases. Default.
fragment_offset
Controls the margin from which you want to start highlighting. Only valid when using the fvh highlighter.
fragment_size
The size of the highlighted fragment in characters. Defaults to 100.
highlight_query

Highlight matches for a query other than the search query. This is especially useful if you use a rescore query because those are not taken into account by highlighting by default.

Elasticsearch does not validate that highlight_query contains the search query in any way so it is possible to define it so legitimate query results are not highlighted. Generally, you should include the search query as part of the highlight_query.

matched_fields
Combine matches on multiple fields to highlight a single field. This is most intuitive for multifields that analyze the same string in different ways. All matched_fields must have term_vector set to with_positions_offsets, but only the field to which the matches are combined is loaded so only that field benefits from having store set to yes. Only valid for the fvh highlighter.
no_match_size
The amount of text you want to return from the beginning of the field if there are no matching fragments to highlight. Defaults to 0 (nothing is returned).
number_of_fragments
The maximum number of fragments to return. If the number of fragments is set to 0, no fragments are returned. Instead, the entire field contents are highlighted and returned. This can be handy when you need to highlight short texts such as a title or address, but fragmentation is not required. If number_of_fragments is 0, fragment_size is ignored. Defaults to 5.
order
Sorts highlighted fragments by score when set to score. By default, fragments will be output in the order they appear in the field (order: none). Setting this option to score will output the most relevant fragments first. Each highlighter applies its own logic to compute relevancy scores. See the document How highlighters work internally for more details how different highlighters find the best fragments.
phrase_limit
Controls the number of matching phrases in a document that are considered. Prevents the fvh highlighter from analyzing too many phrases and consuming too much memory. When using matched_fields, phrase_limit phrases per matched field are considered. Raising the limit increases query time and consumes more memory. Only supported by the fvh highlighter. Defaults to 256.
pre_tags
Use in conjunction with post_tags to define the HTML tags to use for the highlighted text. By default, highlighted text is wrapped in <em> and </em> tags. Specify as an array of strings.
post_tags
Use in conjunction with pre_tags to define the HTML tags to use for the highlighted text. By default, highlighted text is wrapped in <em> and </em> tags. Specify as an array of strings.
require_field_match
By default, only fields that contains a query match are highlighted. Set require_field_match to false to highlight all fields. Defaults to true.
tags_schema

Set to styled to use the built-in tag schema. The styled schema defines the following pre_tags and defines post_tags as </em>.

<em class="hlt1">, <em class="hlt2">, <em class="hlt3">,
<em class="hlt4">, <em class="hlt5">, <em class="hlt6">,
<em class="hlt7">, <em class="hlt8">, <em class="hlt9">,
<em class="hlt10">
type
The highlighter to use: unified, plain, or fvh. Defaults to unified.

Highlighting Examples

edit

Override global settings

edit

You can specify highlighter settings globally and selectively override them for individual fields.

GET /_search
{
    "query" : {
        "match": { "user": "kimchy" }
    },
    "highlight" : {
        "number_of_fragments" : 3,
        "fragment_size" : 150,
        "fields" : {
            "body" : { "pre_tags" : ["<em>"], "post_tags" : ["</em>"] },
            "blog.title" : { "number_of_fragments" : 0 },
            "blog.author" : { "number_of_fragments" : 0 },
            "blog.comment" : { "number_of_fragments" : 5, "order" : "score" }
        }
    }
}

Specify a highlight query

edit

You can specify a highlight_query to take additional information into account when highlighting. For example, the following query includes both the search query and rescore query in the highlight_query. Without the highlight_query, highlighting would only take the search query into account.

GET /_search
{
    "stored_fields": [ "_id" ],
    "query" : {
        "match": {
            "comment": {
                "query": "foo bar"
            }
        }
    },
    "rescore": {
        "window_size": 50,
        "query": {
            "rescore_query" : {
                "match_phrase": {
                    "comment": {
                        "query": "foo bar",
                        "slop": 1
                    }
                }
            },
            "rescore_query_weight" : 10
        }
    },
    "highlight" : {
        "order" : "score",
        "fields" : {
            "comment" : {
                "fragment_size" : 150,
                "number_of_fragments" : 3,
                "highlight_query": {
                    "bool": {
                        "must": {
                            "match": {
                                "comment": {
                                    "query": "foo bar"
                                }
                            }
                        },
                        "should": {
                            "match_phrase": {
                                "comment": {
                                    "query": "foo bar",
                                    "slop": 1,
                                    "boost": 10.0
                                }
                            }
                        },
                        "minimum_should_match": 0
                    }
                }
            }
        }
    }
}

Set highlighter type

edit

The type field allows to force a specific highlighter type. The allowed values are: unified, plain and fvh. The following is an example that forces the use of the plain highlighter:

GET /_search
{
    "query" : {
        "match": { "user": "kimchy" }
    },
    "highlight" : {
        "fields" : {
            "comment" : {"type" : "plain"}
        }
    }
}

Configure highlighting tags

edit

By default, the highlighting will wrap highlighted text in <em> and </em>. This can be controlled by setting pre_tags and post_tags, for example:

GET /_search
{
    "query" : {
        "match": { "user": "kimchy" }
    },
    "highlight" : {
        "pre_tags" : ["<tag1>"],
        "post_tags" : ["</tag1>"],
        "fields" : {
            "body" : {}
        }
    }
}

When using the fast vector highlighter, you can specify additional tags and the "importance" is ordered.

GET /_search
{
    "query" : {
        "match": { "user": "kimchy" }
    },
    "highlight" : {
        "pre_tags" : ["<tag1>", "<tag2>"],
        "post_tags" : ["</tag1>", "</tag2>"],
        "fields" : {
            "body" : {}
        }
    }
}

You can also use the built-in styled tag schema:

GET /_search
{
    "query" : {
        "match": { "user": "kimchy" }
    },
    "highlight" : {
        "tags_schema" : "styled",
        "fields" : {
            "comment" : {}
        }
    }
}

Highlight on source

edit

Forces the highlighting to highlight fields based on the source even if fields are stored separately. Defaults to false.

GET /_search
{
    "query" : {
        "match": { "user": "kimchy" }
    },
    "highlight" : {
        "fields" : {
            "comment" : {"force_source" : true}
        }
    }
}

Highlight in all fields

edit

By default, only fields that contains a query match are highlighted. Set require_field_match to false to highlight all fields.

GET /_search
{
    "query" : {
        "match": { "user": "kimchy" }
    },
    "highlight" : {
        "require_field_match": false,
        "fields": {
                "body" : { "pre_tags" : ["<em>"], "post_tags" : ["</em>"] }
        }
    }
}

Combine matches on multiple fields

edit

This is only supported by the fvh highlighter

The Fast Vector Highlighter can combine matches on multiple fields to highlight a single field. This is most intuitive for multifields that analyze the same string in different ways. All matched_fields must have term_vector set to with_positions_offsets but only the field to which the matches are combined is loaded so only that field would benefit from having store set to yes.

In the following examples, comment is analyzed by the english analyzer and comment.plain is analyzed by the standard analyzer.

GET /_search
{
    "query": {
        "query_string": {
            "query": "comment.plain:running scissors",
            "fields": ["comment"]
        }
    },
    "highlight": {
        "order": "score",
        "fields": {
            "comment": {
                "matched_fields": ["comment", "comment.plain"],
                "type" : "fvh"
            }
        }
    }
}

The above matches both "run with scissors" and "running with scissors" and would highlight "running" and "scissors" but not "run". If both phrases appear in a large document then "running with scissors" is sorted above "run with scissors" in the fragments list because there are more matches in that fragment.

GET /_search
{
    "query": {
        "query_string": {
            "query": "running scissors",
            "fields": ["comment", "comment.plain^10"]
        }
    },
    "highlight": {
        "order": "score",
        "fields": {
            "comment": {
                "matched_fields": ["comment", "comment.plain"],
                "type" : "fvh"
            }
        }
    }
}

The above highlights "run" as well as "running" and "scissors" but still sorts "running with scissors" above "run with scissors" because the plain match ("running") is boosted.

GET /_search
{
    "query": {
        "query_string": {
            "query": "running scissors",
            "fields": ["comment", "comment.plain^10"]
        }
    },
    "highlight": {
        "order": "score",
        "fields": {
            "comment": {
                "matched_fields": ["comment.plain"],
                "type" : "fvh"
            }
        }
    }
}

The above query wouldn’t highlight "run" or "scissor" but shows that it is just fine not to list the field to which the matches are combined (comment) in the matched fields.

Technically it is also fine to add fields to matched_fields that don’t share the same underlying string as the field to which the matches are combined. The results might not make much sense and if one of the matches is off the end of the text then the whole query will fail.

There is a small amount of overhead involved with setting matched_fields to a non-empty array so always prefer

    "highlight": {
        "fields": {
            "comment": {}
        }
    }

to

    "highlight": {
        "fields": {
            "comment": {
                "matched_fields": ["comment"],
                "type" : "fvh"
            }
        }
    }

Explicitly order highlighted fields

edit

Elasticsearch highlights the fields in the order that they are sent, but per the JSON spec, objects are unordered. If you need to be explicit about the order in which fields are highlighted specify the fields as an array:

GET /_search
{
    "highlight": {
        "fields": [
            { "title": {} },
            { "text": {} }
        ]
    }
}

None of the highlighters built into Elasticsearch care about the order that the fields are highlighted but a plugin might.

Control highlighted fragments

edit

Each field highlighted can control the size of the highlighted fragment in characters (defaults to 100), and the maximum number of fragments to return (defaults to 5). For example:

GET /_search
{
    "query" : {
        "match": { "user": "kimchy" }
    },
    "highlight" : {
        "fields" : {
            "comment" : {"fragment_size" : 150, "number_of_fragments" : 3}
        }
    }
}

On top of this it is possible to specify that highlighted fragments need to be sorted by score:

GET /_search
{
    "query" : {
        "match": { "user": "kimchy" }
    },
    "highlight" : {
        "order" : "score",
        "fields" : {
            "comment" : {"fragment_size" : 150, "number_of_fragments" : 3}
        }
    }
}

If the number_of_fragments value is set to 0 then no fragments are produced, instead the whole content of the field is returned, and of course it is highlighted. This can be very handy if short texts (like document title or address) need to be highlighted but no fragmentation is required. Note that fragment_size is ignored in this case.

GET /_search
{
    "query" : {
        "match": { "user": "kimchy" }
    },
    "highlight" : {
        "fields" : {
            "body" : {},
            "blog.title" : {"number_of_fragments" : 0}
        }
    }
}

When using fvh one can use fragment_offset parameter to control the margin to start highlighting from.

In the case where there is no matching fragment to highlight, the default is to not return anything. Instead, we can return a snippet of text from the beginning of the field by setting no_match_size (default 0) to the length of the text that you want returned. The actual length may be shorter or longer than specified as it tries to break on a word boundary.

GET /_search
{
    "query" : {
        "match": { "user": "kimchy" }
    },
    "highlight" : {
        "fields" : {
            "comment" : {
                "fragment_size" : 150,
                "number_of_fragments" : 3,
                "no_match_size": 150
            }
        }
    }
}

Highlight using the postings list

edit

Here is an example of setting the comment field in the index mapping to allow for highlighting using the postings:

PUT /example
{
  "mappings": {
    "properties": {
      "comment" : {
        "type": "text",
        "index_options" : "offsets"
      }
    }
  }
}

Here is an example of setting the comment field to allow for highlighting using the term_vectors (this will cause the index to be bigger):

PUT /example
{
  "mappings": {
    "properties": {
      "comment" : {
        "type": "text",
        "term_vector" : "with_positions_offsets"
      }
    }
  }
}

Specify a fragmenter for the plain highlighter

edit

When using the plain highlighter, you can choose between the simple and span fragmenters:

GET twitter/_search
{
    "query" : {
        "match_phrase": { "message": "number 1" }
    },
    "highlight" : {
        "fields" : {
            "message" : {
                "type": "plain",
                "fragment_size" : 15,
                "number_of_fragments" : 3,
                "fragmenter": "simple"
            }
        }
    }
}

Response:

{
    ...
    "hits": {
        "total" : {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.601195,
        "hits": [
            {
                "_index": "twitter",
                "_type": "_doc",
                "_id": "1",
                "_score": 1.601195,
                "_source": {
                    "user": "test",
                    "message": "some message with the number 1",
                    "date": "2009-11-15T14:12:12",
                    "likes": 1
                },
                "highlight": {
                    "message": [
                        " with the <em>number</em>",
                        " <em>1</em>"
                    ]
                }
            }
        ]
    }
}
GET twitter/_search
{
    "query" : {
        "match_phrase": { "message": "number 1" }
    },
    "highlight" : {
        "fields" : {
            "message" : {
                "type": "plain",
                "fragment_size" : 15,
                "number_of_fragments" : 3,
                "fragmenter": "span"
            }
        }
    }
}

Response:

{
    ...
    "hits": {
        "total" : {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.601195,
        "hits": [
            {
                "_index": "twitter",
                "_type": "_doc",
                "_id": "1",
                "_score": 1.601195,
                "_source": {
                    "user": "test",
                    "message": "some message with the number 1",
                    "date": "2009-11-15T14:12:12",
                    "likes": 1
                },
                "highlight": {
                    "message": [
                        " with the <em>number</em> <em>1</em>"
                    ]
                }
            }
        ]
    }
}

If the number_of_fragments option is set to 0, NullFragmenter is used which does not fragment the text at all. This is useful for highlighting the entire contents of a document or field.

How highlighters work internally

edit

Given a query and a text (the content of a document field), the goal of a highlighter is to find the best text fragments for the query, and highlight the query terms in the found fragments. For this, a highlighter needs to address several questions:

  • How break a text into fragments?
  • How to find the best fragments among all fragments?
  • How to highlight the query terms in a fragment?

How to break a text into fragments?

edit

Relevant settings: fragment_size, fragmenter, type of highlighter, boundary_chars, boundary_max_scan, boundary_scanner, boundary_scanner_locale.

Plain highlighter begins with analyzing the text using the given analyzer, and creating a token stream from it. Plain highlighter uses a very simple algorithm to break the token stream into fragments. It loops through terms in the token stream, and every time the current term’s end_offset exceeds fragment_size multiplied by the number of created fragments, a new fragment is created. A little more computation is done with using span fragmenter to avoid breaking up text between highlighted terms. But overall, since the breaking is done only by fragment_size, some fragments can be quite odd, e.g. beginning with a punctuation mark.

Unified or FVH highlighters do a better job of breaking up a text into fragments by utilizing Java’s BreakIterator. This ensures that a fragment is a valid sentence as long as fragment_size allows for this.

How to find the best fragments?

edit

Relevant settings: number_of_fragments.

To find the best, most relevant, fragments, a highlighter needs to score each fragment in respect to the given query. The goal is to score only those terms that participated in generating the hit on the document. For some complex queries, this is still work in progress.

The plain highlighter creates an in-memory index from the current token stream, and re-runs the original query criteria through Lucene’s query execution planner to get access to low-level match information for the current text. For more complex queries the original query could be converted to a span query, as span queries can handle phrases more accurately. Then this obtained low-level match information is used to score each individual fragment. The scoring method of the plain highlighter is quite simple. Each fragment is scored by the number of unique query terms found in this fragment. The score of individual term is equal to its boost, which is by default is 1. Thus, by default, a fragment that contains one unique query term, will get a score of 1; and a fragment that contains two unique query terms, will get a score of 2 and so on. The fragments are then sorted by their scores, so the highest scored fragments will be output first.

FVH doesn’t need to analyze the text and build an in-memory index, as it uses pre-indexed document term vectors, and finds among them terms that correspond to the query. FVH scores each fragment by the number of query terms found in this fragment. Similarly to plain highlighter, score of individual term is equal to its boost value. In contrast to plain highlighter, all query terms are counted, not only unique terms.

Unified highlighter can use pre-indexed term vectors or pre-indexed terms offsets, if they are available. Otherwise, similar to Plain Highlighter, it has to create an in-memory index from the text. Unified highlighter uses the BM25 scoring model to score fragments.

How to highlight the query terms in a fragment?

edit

Relevant settings: pre-tags, post-tags.

The goal is to highlight only those terms that participated in generating the hit on the document. For some complex boolean queries, this is still work in progress, as highlighters don’t reflect the boolean logic of a query and only extract leaf (terms, phrases, prefix etc) queries.

Plain highlighter given the token stream and the original text, recomposes the original text to highlight only terms from the token stream that are contained in the low-level match information structure from the previous step.

FVH and unified highlighter use intermediate data structures to represent fragments in some raw form, and then populate them with actual text.

A highlighter uses pre-tags, post-tags to encode highlighted terms.

An example of the work of the unified highlighter

edit

Let’s look in more details how unified highlighter works.

First, we create a index with a text field content, that will be indexed using english analyzer, and will be indexed without offsets or term vectors.

PUT test_index
{
    "mappings": {
        "properties": {
            "content" : {
                "type" : "text",
                "analyzer" : "english"
            }
        }
    }
}

We put the following document into the index:

PUT test_index/_doc/doc1
{
  "content" : "For you I'm only a fox like a hundred thousand other foxes. But if you tame me, we'll need each other. You'll be the only boy in the world for me. I'll be the only fox in the world for you."
}

And we ran the following query with a highlight request:

GET test_index/_search
{
    "query": {
        "match_phrase" : {"content" : "only fox"}
    },
    "highlight": {
        "type" : "unified",
        "number_of_fragments" : 3,
        "fields": {
            "content": {}
        }
    }
}

After doc1 is found as a hit for this query, this hit will be passed to the unified highlighter for highlighting the field content of the document. Since the field content was not indexed either with offsets or term vectors, its raw field value will be analyzed, and in-memory index will be built from the terms that match the query:

{"token":"onli","start_offset":12,"end_offset":16,"position":3},
{"token":"fox","start_offset":19,"end_offset":22,"position":5},
{"token":"fox","start_offset":53,"end_offset":58,"position":11},
{"token":"onli","start_offset":117,"end_offset":121,"position":24},
{"token":"onli","start_offset":159,"end_offset":163,"position":34},
{"token":"fox","start_offset":164,"end_offset":167,"position":35}

Our complex phrase query will be converted to the span query: spanNear([text:onli, text:fox], 0, true), meaning that we are looking for terms "onli: and "fox" within 0 distance from each other, and in the given order. The span query will be run against the created before in-memory index, to find the following match:

{"term":"onli", "start_offset":159, "end_offset":163},
{"term":"fox", "start_offset":164, "end_offset":167}

In our example, we have got a single match, but there could be several matches. Given the matches, the unified highlighter breaks the text of the field into so called "passages". Each passage must contain at least one match. The unified highlighter with the use of Java’s BreakIterator ensures that each passage represents a full sentence as long as it doesn’t exceed fragment_size. For our example, we have got a single passage with the following properties (showing only a subset of the properties here):

Passage:
    startOffset: 147
    endOffset: 189
    score: 3.7158387
    matchStarts: [159, 164]
    matchEnds: [163, 167]
    numMatches: 2

Notice how a passage has a score, calculated using the BM25 scoring formula adapted for passages. Scores allow us to choose the best scoring passages if there are more passages available than the requested by the user number_of_fragments. Scores also let us to sort passages by order: "score" if requested by the user.

As the final step, the unified highlighter will extract from the field’s text a string corresponding to each passage:

"I'll be the only fox in the world for you."

and will format with the tags <em> and </em> all matches in this string using the passages’s matchStarts and matchEnds information:

I'll be the <em>only</em> <em>fox</em> in the world for you.

This kind of formatted strings are the final result of the highlighter returned to the user.

Index Boost

edit

Allows to configure different boost level per index when searching across more than one indices. This is very handy when hits coming from one index matter more than hits coming from another index (think social graph where each user has an index).

Deprecated in 5.2.0.

This format is deprecated. Please use array format instead.

GET /_search
{
    "indices_boost" : {
        "index1" : 1.4,
        "index2" : 1.3
    }
}

You can also specify it as an array to control the order of boosts.

GET /_search
{
    "indices_boost" : [
        { "alias1" : 1.4 },
        { "index*" : 1.3 }
    ]
}

This is important when you use aliases or wildcard expression. If multiple matches are found, the first match will be used. For example, if an index is included in both alias1 and index*, boost value of 1.4 is applied.

Inner hits

edit

The parent-join and nested features allow the return of documents that have matches in a different scope. In the parent/child case, parent documents are returned based on matches in child documents or child documents are returned based on matches in parent documents. In the nested case, documents are returned based on matches in nested inner objects.

In both cases, the actual matches in the different scopes that caused a document to be returned are hidden. In many cases, it’s very useful to know which inner nested objects (in the case of nested) or children/parent documents (in the case of parent/child) caused certain information to be returned. The inner hits feature can be used for this. This feature returns per search hit in the search response additional nested hits that caused a search hit to match in a different scope.

Inner hits can be used by defining an inner_hits definition on a nested, has_child or has_parent query and filter. The structure looks like this:

"<query>" : {
    "inner_hits" : {
        <inner_hits_options>
    }
}

If inner_hits is defined on a query that supports it then each search hit will contain an inner_hits json object with the following structure:

"hits": [
     {
        "_index": ...,
        "_type": ...,
        "_id": ...,
        "inner_hits": {
           "<inner_hits_name>": {
              "hits": {
                 "total": ...,
                 "hits": [
                    {
                       "_type": ...,
                       "_id": ...,
                       ...
                    },
                    ...
                 ]
              }
           }
        },
        ...
     },
     ...
]

Options

edit

Inner hits support the following options:

from

The offset from where the first hit to fetch for each inner_hits in the returned regular search hits.

size

The maximum number of hits to return per inner_hits. By default the top three matching hits are returned.

sort

How the inner hits should be sorted per inner_hits. By default the hits are sorted by the score.

name

The name to be used for the particular inner hit definition in the response. Useful when multiple inner hits have been defined in a single search request. The default depends in which query the inner hit is defined. For has_child query and filter this is the child type, has_parent query and filter this is the parent type and the nested query and filter this is the nested path.

Inner hits also supports the following per document features:

Nested inner hits

edit

The nested inner_hits can be used to include nested inner objects as inner hits to a search hit.

PUT test
{
  "mappings": {
    "properties": {
      "comments": {
        "type": "nested"
      }
    }
  }
}

PUT test/_doc/1?refresh
{
  "title": "Test title",
  "comments": [
    {
      "author": "kimchy",
      "number": 1
    },
    {
      "author": "nik9000",
      "number": 2
    }
  ]
}

POST test/_search
{
  "query": {
    "nested": {
      "path": "comments",
      "query": {
        "match": {"comments.number" : 2}
      },
      "inner_hits": {} 
    }
  }
}

The inner hit definition in the nested query. No other options need to be defined.

An example of a response snippet that could be generated from the above search request:

{
  ...,
  "hits": {
    "total" : {
        "value": 1,
        "relation": "eq"
    },
    "max_score": 1.0,
    "hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "1",
        "_score": 1.0,
        "_source": ...,
        "inner_hits": {
          "comments": { 
            "hits": {
              "total" : {
                  "value": 1,
                  "relation": "eq"
              },
              "max_score": 1.0,
              "hits": [
                {
                  "_index": "test",
                  "_type": "_doc",
                  "_id": "1",
                  "_nested": {
                    "field": "comments",
                    "offset": 1
                  },
                  "_score": 1.0,
                  "_source": {
                    "author": "nik9000",
                    "number": 2
                  }
                }
              ]
            }
          }
        }
      }
    ]
  }
}

The name used in the inner hit definition in the search request. A custom key can be used via the name option.

The _nested metadata is crucial in the above example, because it defines from what inner nested object this inner hit came from. The field defines the object array field the nested hit is from and the offset relative to its location in the _source. Due to sorting and scoring the actual location of the hit objects in the inner_hits is usually different than the location a nested inner object was defined.

By default the _source is returned also for the hit objects in inner_hits, but this can be changed. Either via _source filtering feature part of the source can be returned or be disabled. If stored fields are defined on the nested level these can also be returned via the fields feature.

An important default is that the _source returned in hits inside inner_hits is relative to the _nested metadata. So in the above example only the comment part is returned per nested hit and not the entire source of the top level document that contained the comment.

Nested inner hits and _source

edit

Nested document don’t have a _source field, because the entire source of document is stored with the root document under its _source field. To include the source of just the nested document, the source of the root document is parsed and just the relevant bit for the nested document is included as source in the inner hit. Doing this for each matching nested document has an impact on the time it takes to execute the entire search request, especially when size and the inner hits' size are set higher than the default. To avoid the relatively expensive source extraction for nested inner hits, one can disable including the source and solely rely on doc values fields. Like this:

PUT test
{
  "mappings": {
    "properties": {
      "comments": {
        "type": "nested"
      }
    }
  }
}

PUT test/_doc/1?refresh
{
  "title": "Test title",
  "comments": [
    {
      "author": "kimchy",
      "text": "comment text"
    },
    {
      "author": "nik9000",
      "text": "words words words"
    }
  ]
}

POST test/_search
{
  "query": {
    "nested": {
      "path": "comments",
      "query": {
        "match": {"comments.text" : "words"}
      },
      "inner_hits": {
        "_source" : false,
        "docvalue_fields" : [
          "comments.text.keyword"
        ]
      }
    }
  }
}

Hierarchical levels of nested object fields and inner hits.

edit

If a mapping has multiple levels of hierarchical nested object fields each level can be accessed via dot notated path. For example if there is a comments nested field that contains a votes nested field and votes should directly be returned with the root hits then the following path can be defined:

PUT test
{
  "mappings": {
    "properties": {
      "comments": {
        "type": "nested",
        "properties": {
          "votes": {
            "type": "nested"
          }
        }
      }
    }
  }
}

PUT test/_doc/1?refresh
{
  "title": "Test title",
  "comments": [
    {
      "author": "kimchy",
      "text": "comment text",
      "votes": []
    },
    {
      "author": "nik9000",
      "text": "words words words",
      "votes": [
        {"value": 1 , "voter": "kimchy"},
        {"value": -1, "voter": "other"}
      ]
    }
  ]
}

POST test/_search
{
  "query": {
    "nested": {
      "path": "comments.votes",
        "query": {
          "match": {
            "comments.votes.voter": "kimchy"
          }
        },
        "inner_hits" : {}
    }
  }
}

Which would look like:

{
  ...,
  "hits": {
    "total" : {
        "value": 1,
        "relation": "eq"
    },
    "max_score": 0.6931472,
    "hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.6931472,
        "_source": ...,
        "inner_hits": {
          "comments.votes": { 
            "hits": {
              "total" : {
                  "value": 1,
                  "relation": "eq"
              },
              "max_score": 0.6931472,
              "hits": [
                {
                  "_index": "test",
                  "_type": "_doc",
                  "_id": "1",
                  "_nested": {
                    "field": "comments",
                    "offset": 1,
                    "_nested": {
                      "field": "votes",
                      "offset": 0
                    }
                  },
                  "_score": 0.6931472,
                  "_source": {
                    "value": 1,
                    "voter": "kimchy"
                  }
                }
              ]
            }
          }
        }
      }
    ]
  }
}

This indirect referencing is only supported for nested inner hits.

Parent/child inner hits

edit

The parent/child inner_hits can be used to include parent or child:

PUT test
{
  "mappings": {
    "properties": {
      "my_join_field": {
        "type": "join",
        "relations": {
          "my_parent": "my_child"
        }
      }
    }
  }
}

PUT test/_doc/1?refresh
{
  "number": 1,
  "my_join_field": "my_parent"
}

PUT test/_doc/2?routing=1&refresh
{
  "number": 1,
  "my_join_field": {
    "name": "my_child",
    "parent": "1"
  }
}

POST test/_search
{
  "query": {
    "has_child": {
      "type": "my_child",
      "query": {
        "match": {
          "number": 1
        }
      },
      "inner_hits": {}    
    }
  }
}

The inner hit definition like in the nested example.

An example of a response snippet that could be generated from the above search request:

{
    ...,
    "hits": {
        "total" : {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "test",
                "_type": "_doc",
                "_id": "1",
                "_score": 1.0,
                "_source": {
                    "number": 1,
                    "my_join_field": "my_parent"
                },
                "inner_hits": {
                    "my_child": {
                        "hits": {
                            "total" : {
                                "value": 1,
                                "relation": "eq"
                            },
                            "max_score": 1.0,
                            "hits": [
                                {
                                    "_index": "test",
                                    "_type": "_doc",
                                    "_id": "2",
                                    "_score": 1.0,
                                    "_routing": "1",
                                    "_source": {
                                        "number": 1,
                                        "my_join_field": {
                                            "name": "my_child",
                                            "parent": "1"
                                        }
                                    }
                                }
                            ]
                        }
                    }
                }
            }
        ]
    }
}

min_score

edit

Exclude documents which have a _score less than the minimum specified in min_score:

GET /_search
{
    "min_score": 0.5,
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Note, most times, this does not make much sense, but is provided for advanced use cases.

Named Queries

edit

Each filter and query can accept a _name in its top level definition.

GET /_search
{
    "query": {
        "bool" : {
            "should" : [
                {"match" : { "name.first" : {"query" : "shay", "_name" : "first"} }},
                {"match" : { "name.last" : {"query" : "banon", "_name" : "last"} }}
            ],
            "filter" : {
                "terms" : {
                    "name.last" : ["banon", "kimchy"],
                    "_name" : "test"
                }
            }
        }
    }
}

The search response will include for each hit the matched_queries it matched on. The tagging of queries and filters only make sense for the bool query.

Post filter

edit

The post_filter is applied to the search hits at the very end of a search request, after aggregations have already been calculated. Its purpose is best explained by example:

Imagine that you are selling shirts that have the following properties:

PUT /shirts
{
    "mappings": {
        "properties": {
            "brand": { "type": "keyword"},
            "color": { "type": "keyword"},
            "model": { "type": "keyword"}
        }
    }
}

PUT /shirts/_doc/1?refresh
{
    "brand": "gucci",
    "color": "red",
    "model": "slim"
}

Imagine a user has specified two filters:

color:red and brand:gucci. You only want to show them red shirts made by Gucci in the search results. Normally you would do this with a bool query:

GET /shirts/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "color": "red"   }},
        { "term": { "brand": "gucci" }}
      ]
    }
  }
}

However, you would also like to use faceted navigation to display a list of other options that the user could click on. Perhaps you have a model field that would allow the user to limit their search results to red Gucci t-shirts or dress-shirts.

This can be done with a terms aggregation:

GET /shirts/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term": { "color": "red"   }},
        { "term": { "brand": "gucci" }}
      ]
    }
  },
  "aggs": {
    "models": {
      "terms": { "field": "model" } 
    }
  }
}

Returns the most popular models of red shirts by Gucci.

But perhaps you would also like to tell the user how many Gucci shirts are available in other colors. If you just add a terms aggregation on the color field, you will only get back the color red, because your query returns only red shirts by Gucci.

Instead, you want to include shirts of all colors during aggregation, then apply the colors filter only to the search results. This is the purpose of the post_filter:

GET /shirts/_search
{
  "query": {
    "bool": {
      "filter": {
        "term": { "brand": "gucci" } 
      }
    }
  },
  "aggs": {
    "colors": {
      "terms": { "field": "color" } 
    },
    "color_red": {
      "filter": {
        "term": { "color": "red" } 
      },
      "aggs": {
        "models": {
          "terms": { "field": "model" } 
        }
      }
    }
  },
  "post_filter": { 
    "term": { "color": "red" }
  }
}

The main query now finds all shirts by Gucci, regardless of color.

The colors agg returns popular colors for shirts by Gucci.

The color_red agg limits the models sub-aggregation to red Gucci shirts.

Finally, the post_filter removes colors other than red from the search hits.

Preference

edit

Controls a preference of the shard copies on which to execute the search. By default, Elasticsearch selects from the available shard copies in an unspecified order, taking the allocation awareness and adaptive replica selection configuration into account. However, it may sometimes be desirable to try and route certain searches to certain sets of shard copies.

A possible use case would be to make use of per-copy caches like the request cache. Doing this, however, runs contrary to the idea of search parallelization and can create hotspots on certain nodes because the load might not be evenly distributed anymore.

The preference is a query string parameter which can be set to:

_only_local

The operation will be executed only on shards allocated to the local node.

_local

The operation will be executed on shards allocated to the local node if possible, and will fall back to other shards if not.

_prefer_nodes:abc,xyz

The operation will be executed on nodes with one of the provided node ids (abc or xyz in this case) if possible. If suitable shard copies exist on more than one of the selected nodes then the order of preference between these copies is unspecified.

_shards:2,3

Restricts the operation to the specified shards. (2 and 3 in this case). This preference can be combined with other preferences but it has to appear first: _shards:2,3|_local

_only_nodes:abc*,x*yz,...

Restricts the operation to nodes specified according to the node specification. If suitable shard copies exist on more than one of the selected nodes then the order of preference between these copies is unspecified.

Custom (string) value

Any value that does not start with _. If two searches both give the same custom string value for their preference and the underlying cluster state does not change then the same ordering of shards will be used for the searches. This does not guarantee that the exact same shards will be used each time: the cluster state, and therefore the selected shards, may change for a number of reasons including shard relocations and shard failures, and nodes may sometimes reject searches causing fallbacks to alternative nodes. However, in practice the ordering of shards tends to remain stable for long periods of time. A good candidate for a custom preference value is something like the web session id or the user name.

For instance, use the user’s session ID xyzabc123 as follows:

GET /_search?preference=xyzabc123
{
    "query": {
        "match": {
            "title": "elasticsearch"
        }
    }
}

This can be an effective strategy to increase usage of e.g. the request cache for unique users running similar searches repeatedly by always hitting the same cache, while requests of different users are still spread across all shard copies.

The _only_local preference guarantees only to use shard copies on the local node, which is sometimes useful for troubleshooting. All other options do not fully guarantee that any particular shard copies are used in a search, and on a changing index this may mean that repeated searches may yield different results if they are executed on different shard copies which are in different refresh states.

Query

edit

The query element within the search request body allows to define a query using the Query DSL.

GET /_search
{
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Rescoring

edit

Rescoring can help to improve precision by reordering just the top (eg 100 - 500) documents returned by the query and post_filter phases, using a secondary (usually more costly) algorithm, instead of applying the costly algorithm to all documents in the index.

A rescore request is executed on each shard before it returns its results to be sorted by the node handling the overall search request.

Currently the rescore API has only one implementation: the query rescorer, which uses a query to tweak the scoring. In the future, alternative rescorers may be made available, for example, a pair-wise rescorer.

An error will be thrown if an explicit sort (other than _score in descending order) is provided with a rescore query.

when exposing pagination to your users, you should not change window_size as you step through each page (by passing different from values) since that can alter the top hits causing results to confusingly shift as the user steps through pages.

Query rescorer

edit

The query rescorer executes a second query only on the Top-K results returned by the query and post_filter phases. The number of docs which will be examined on each shard can be controlled by the window_size parameter, which defaults to 10.

By default the scores from the original query and the rescore query are combined linearly to produce the final _score for each document. The relative importance of the original query and of the rescore query can be controlled with the query_weight and rescore_query_weight respectively. Both default to 1.

For example:

POST /_search
{
   "query" : {
      "match" : {
         "message" : {
            "operator" : "or",
            "query" : "the quick brown"
         }
      }
   },
   "rescore" : {
      "window_size" : 50,
      "query" : {
         "rescore_query" : {
            "match_phrase" : {
               "message" : {
                  "query" : "the quick brown",
                  "slop" : 2
               }
            }
         },
         "query_weight" : 0.7,
         "rescore_query_weight" : 1.2
      }
   }
}

The way the scores are combined can be controlled with the score_mode:

Score Mode Description

total

Add the original score and the rescore query score. The default.

multiply

Multiply the original score by the rescore query score. Useful for function query rescores.

avg

Average the original score and the rescore query score.

max

Take the max of original score and the rescore query score.

min

Take the min of the original score and the rescore query score.

Multiple Rescores

edit

It is also possible to execute multiple rescores in sequence:

POST /_search
{
   "query" : {
      "match" : {
         "message" : {
            "operator" : "or",
            "query" : "the quick brown"
         }
      }
   },
   "rescore" : [ {
      "window_size" : 100,
      "query" : {
         "rescore_query" : {
            "match_phrase" : {
               "message" : {
                  "query" : "the quick brown",
                  "slop" : 2
               }
            }
         },
         "query_weight" : 0.7,
         "rescore_query_weight" : 1.2
      }
   }, {
      "window_size" : 10,
      "query" : {
         "score_mode": "multiply",
         "rescore_query" : {
            "function_score" : {
               "script_score": {
                  "script": {
                    "source": "Math.log10(doc.likes.value + 2)"
                  }
               }
            }
         }
      }
   } ]
}

The first one gets the results of the query then the second one gets the results of the first, etc. The second rescore will "see" the sorting done by the first rescore so it is possible to use a large window on the first rescore to pull documents into a smaller window for the second rescore.

Script Fields

edit

Allows to return a script evaluation (based on different fields) for each hit, for example:

GET /_search
{
    "query" : {
        "match_all": {}
    },
    "script_fields" : {
        "test1" : {
            "script" : {
                "lang": "painless",
                "source": "doc['price'].value * 2"
            }
        },
        "test2" : {
            "script" : {
                "lang": "painless",
                "source": "doc['price'].value * params.factor",
                "params" : {
                    "factor"  : 2.0
                }
            }
        }
    }
}

Script fields can work on fields that are not stored (price in the above case), and allow to return custom values to be returned (the evaluated value of the script).

Script fields can also access the actual _source document and extract specific elements to be returned from it by using params['_source']. Here is an example:

GET /_search
    {
        "query" : {
            "match_all": {}
        },
        "script_fields" : {
            "test1" : {
                "script" : "params['_source']['message']"
            }
        }
    }

Note the _source keyword here to navigate the json-like model.

It’s important to understand the difference between doc['my_field'].value and params['_source']['my_field']. The first, using the doc keyword, will cause the terms for that field to be loaded to memory (cached), which will result in faster execution, but more memory consumption. Also, the doc[...] notation only allows for simple valued fields (you can’t return a json object from it) and makes sense only for non-analyzed or single term based fields. However, using doc is still the recommended way to access values from the document, if at all possible, because _source must be loaded and parsed every time it’s used. Using _source is very slow.

Scroll

edit

While a search request returns a single “page” of results, the scroll API can be used to retrieve large numbers of results (or even all results) from a single search request, in much the same way as you would use a cursor on a traditional database.

Scrolling is not intended for real time user requests, but rather for processing large amounts of data, e.g. in order to reindex the contents of one index into a new index with a different configuration.

The results that are returned from a scroll request reflect the state of the index at the time that the initial search request was made, like a snapshot in time. Subsequent changes to documents (index, update or delete) will only affect later search requests.

In order to use scrolling, the initial search request should specify the scroll parameter in the query string, which tells Elasticsearch how long it should keep the “search context” alive (see Keeping the search context alive), eg ?scroll=1m.

POST /twitter/_search?scroll=1m
{
    "size": 100,
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    }
}

The result from the above request includes a _scroll_id, which should be passed to the scroll API in order to retrieve the next batch of results.

POST /_search/scroll 
{
    "scroll" : "1m", 
    "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==" 
}

GET or POST can be used and the URL should not include the index name — this is specified in the original search request instead.

The scroll parameter tells Elasticsearch to keep the search context open for another 1m.

The scroll_id parameter

The size parameter allows you to configure the maximum number of hits to be returned with each batch of results. Each call to the scroll API returns the next batch of results until there are no more results left to return, ie the hits array is empty.

The initial search request and each subsequent scroll request each return a _scroll_id. While the _scroll_id may change between requests, it doesn’t always change — in any case, only the most recently received _scroll_id should be used.

If the request specifies aggregations, only the initial search response will contain the aggregations results.

Scroll requests have optimizations that make them faster when the sort order is _doc. If you want to iterate over all documents regardless of the order, this is the most efficient option:

GET /_search?scroll=1m
{
  "sort": [
    "_doc"
  ]
}

Keeping the search context alive

edit

A scroll returns all the documents which matched the search at the time of the initial search request. It ignores any subsequent changes to these documents. The scroll_id identifies a search context which keeps track of everything that Elasticsearch needs to return the correct documents. The search context is created by the initial request and kept alive by subsequent requests.

The scroll parameter (passed to the search request and to every scroll request) tells Elasticsearch how long it should keep the search context alive. Its value (e.g. 1m, see Time units) does not need to be long enough to process all data — it just needs to be long enough to process the previous batch of results. Each scroll request (with the scroll parameter) sets a new expiry time. If a scroll request doesn’t pass in the scroll parameter, then the search context will be freed as part of that scroll request.

Normally, the background merge process optimizes the index by merging together smaller segments to create new, bigger segments. Once the smaller segments are no longer needed they are deleted. This process continues during scrolling, but an open search context prevents the old segments from being deleted since they are still in use.

Keeping older segments alive means that more disk space and file handles are needed. Ensure that you have configured your nodes to have ample free file handles. See File Descriptors.

Additionally, if a segment contains deleted or updated documents then the search context must keep track of whether each document in the segment was live at the time of the initial search request. Ensure that your nodes have sufficient heap space if you have many open scrolls on an index that is subject to ongoing deletes or updates.

To prevent against issues caused by having too many scrolls open, the user is not allowed to open scrolls past a certain limit. By default, the maximum number of open scrolls is 500. This limit can be updated with the search.max_open_scroll_context cluster setting.

You can check how many search contexts are open with the nodes stats API:

GET /_nodes/stats/indices/search

Clear scroll API

edit

Search context are automatically removed when the scroll timeout has been exceeded. However keeping scrolls open has a cost, as discussed in the previous section so scrolls should be explicitly cleared as soon as the scroll is not being used anymore using the clear-scroll API:

DELETE /_search/scroll
{
    "scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ=="
}

Multiple scroll IDs can be passed as array:

DELETE /_search/scroll
{
    "scroll_id" : [
      "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==",
      "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAABFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAAAxZrUllkUVlCa1NqNmRMaUhiQlZkMWFBAAAAAAAAAAIWa1JZZFFZQmtTajZkTGlIYkJWZDFhQQAAAAAAAAAFFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAABBZrUllkUVlCa1NqNmRMaUhiQlZkMWFB"
    ]
}

All search contexts can be cleared with the _all parameter:

DELETE /_search/scroll/_all

The scroll_id can also be passed as a query string parameter or in the request body. Multiple scroll IDs can be passed as comma separated values:

DELETE /_search/scroll/DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ==,DnF1ZXJ5VGhlbkZldGNoBQAAAAAAAAABFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAAAxZrUllkUVlCa1NqNmRMaUhiQlZkMWFBAAAAAAAAAAIWa1JZZFFZQmtTajZkTGlIYkJWZDFhQQAAAAAAAAAFFmtSWWRRWUJrU2o2ZExpSGJCVmQxYUEAAAAAAAAABBZrUllkUVlCa1NqNmRMaUhiQlZkMWFB

Sliced Scroll

edit

For scroll queries that return a lot of documents it is possible to split the scroll in multiple slices which can be consumed independently:

GET /twitter/_search?scroll=1m
{
    "slice": {
        "id": 0, 
        "max": 2 
    },
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    }
}
GET /twitter/_search?scroll=1m
{
    "slice": {
        "id": 1,
        "max": 2
    },
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    }
}

The id of the slice

The maximum number of slices

The result from the first request returned documents that belong to the first slice (id: 0) and the result from the second request returned documents that belong to the second slice. Since the maximum number of slices is set to 2 the union of the results of the two requests is equivalent to the results of a scroll query without slicing. By default the splitting is done on the shards first and then locally on each shard using the _id field with the following formula: slice(doc) = floorMod(hashCode(doc._id), max) For instance if the number of shards is equal to 2 and the user requested 4 slices then the slices 0 and 2 are assigned to the first shard and the slices 1 and 3 are assigned to the second shard.

Each scroll is independent and can be processed in parallel like any scroll request.

If the number of slices is bigger than the number of shards the slice filter is very slow on the first calls, it has a complexity of O(N) and a memory cost equals to N bits per slice where N is the total number of documents in the shard. After few calls the filter should be cached and subsequent calls should be faster but you should limit the number of sliced query you perform in parallel to avoid the memory explosion.

To avoid this cost entirely it is possible to use the doc_values of another field to do the slicing but the user must ensure that the field has the following properties:

  • The field is numeric.
  • doc_values are enabled on that field
  • Every document should contain a single value. If a document has multiple values for the specified field, the first value is used.
  • The value for each document should be set once when the document is created and never updated. This ensures that each slice gets deterministic results.
  • The cardinality of the field should be high. This ensures that each slice gets approximately the same amount of documents.
GET /twitter/_search?scroll=1m
{
    "slice": {
        "field": "date",
        "id": 0,
        "max": 10
    },
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    }
}

For append only time-based indices, the timestamp field can be used safely.

By default the maximum number of slices allowed per scroll is limited to 1024. You can update the index.max_slices_per_scroll index setting to bypass this limit.

Search After

edit

Pagination of results can be done by using the from and size but the cost becomes prohibitive when the deep pagination is reached. The index.max_result_window which defaults to 10,000 is a safeguard, search requests take heap memory and time proportional to from + size. The Scroll api is recommended for efficient deep scrolling but scroll contexts are costly and it is not recommended to use it for real time user requests. The search_after parameter circumvents this problem by providing a live cursor. The idea is to use the results from the previous page to help the retrieval of the next page.

Suppose that the query to retrieve the first page looks like this:

GET twitter/_search
{
    "size": 10,
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    },
    "sort": [
        {"date": "asc"},
        {"tie_breaker_id": "asc"}      
    ]
}

A copy of the _id field with doc_values enabled

A field with one unique value per document should be used as the tiebreaker of the sort specification. Otherwise the sort order for documents that have the same sort values would be undefined and could lead to missing or duplicate results. The _id field has a unique value per document but it is not recommended to use it as a tiebreaker directly. Beware that search_after looks for the first document which fully or partially matches tiebreaker’s provided value. Therefore if a document has a tiebreaker value of "654323" and you search_after for "654" it would still match that document and return results found after it. doc value are disabled on this field so sorting on it requires to load a lot of data in memory. Instead it is advised to duplicate (client side or with a set ingest processor) the content of the _id field in another field that has doc value enabled and to use this new field as the tiebreaker for the sort.

The result from the above request includes an array of sort values for each document. These sort values can be used in conjunction with the search_after parameter to start returning results "after" any document in the result list. For instance we can use the sort values of the last document and pass it to search_after to retrieve the next page of results:

GET twitter/_search
{
    "size": 10,
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    },
    "search_after": [1463538857, "654323"],
    "sort": [
        {"date": "asc"},
        {"tie_breaker_id": "asc"}
    ]
}

The parameter from must be set to 0 (or -1) when search_after is used.

search_after is not a solution to jump freely to a random page but rather to scroll many queries in parallel. It is very similar to the scroll API but unlike it, the search_after parameter is stateless, it is always resolved against the latest version of the searcher. For this reason the sort order may change during a walk depending on the updates and deletes of your index.

Search Type

edit

There are different execution paths that can be done when executing a distributed search. The distributed search operation needs to be scattered to all the relevant shards and then all the results are gathered back. When doing scatter/gather type execution, there are several ways to do that, specifically with search engines.

One of the questions when executing a distributed search is how many results to retrieve from each shard. For example, if we have 10 shards, the 1st shard might hold the most relevant results from 0 till 10, with other shards results ranking below it. For this reason, when executing a request, we will need to get results from 0 till 10 from all shards, sort them, and then return the results if we want to ensure correct results.

Another question, which relates to the search engine, is the fact that each shard stands on its own. When a query is executed on a specific shard, it does not take into account term frequencies and other search engine information from the other shards. If we want to support accurate ranking, we would need to first gather the term frequencies from all shards to calculate global term frequencies, then execute the query on each shard using these global frequencies.

Also, because of the need to sort the results, getting back a large document set, or even scrolling it, while maintaining the correct sorting behavior can be a very expensive operation. For large result set scrolling, it is best to sort by _doc if the order in which documents are returned is not important.

Elasticsearch is very flexible and allows to control the type of search to execute on a per search request basis. The type can be configured by setting the search_type parameter in the query string. The types are:

Query Then Fetch

edit

Parameter value: query_then_fetch.

The request is processed in two phases. In the first phase, the query is forwarded to all involved shards. Each shard executes the search and generates a sorted list of results, local to that shard. Each shard returns just enough information to the coordinating node to allow it merge and re-sort the shard level results into a globally sorted set of results, of maximum length size.

During the second phase, the coordinating node requests the document content (and highlighted snippets, if any) from only the relevant shards.

This is the default setting, if you do not specify a search_type in your request.

Dfs, Query Then Fetch

edit

Parameter value: dfs_query_then_fetch.

Same as "Query Then Fetch", except for an initial scatter phase which goes and computes the distributed term frequencies for more accurate scoring.

Sequence Numbers and Primary Term

edit

Returns the sequence number and primary term of the last modification to each search hit. See Optimistic concurrency control for more details.

GET /_search
{
    "seq_no_primary_term": true,
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Sort

edit

Allows you to add one or more sorts on specific fields. Each sort can be reversed as well. The sort is defined on a per field level, with special field name for _score to sort by score, and _doc to sort by index order.

Assuming the following index mapping:

PUT /my_index
{
    "mappings": {
        "properties": {
            "post_date": { "type": "date" },
            "user": {
                "type": "keyword"
            },
            "name": {
                "type": "keyword"
            },
            "age": { "type": "integer" }
        }
    }
}
GET /my_index/_search
{
    "sort" : [
        { "post_date" : {"order" : "asc"}},
        "user",
        { "name" : "desc" },
        { "age" : "desc" },
        "_score"
    ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

_doc has no real use-case besides being the most efficient sort order. So if you don’t care about the order in which documents are returned, then you should sort by _doc. This especially helps when scrolling.

Sort Values

edit

The sort values for each document returned are also returned as part of the response.

Sort Order

edit

The order option can have the following values:

asc

Sort in ascending order

desc

Sort in descending order

The order defaults to desc when sorting on the _score, and defaults to asc when sorting on anything else.

Sort mode option

edit

Elasticsearch supports sorting by array or multi-valued fields. The mode option controls what array value is picked for sorting the document it belongs to. The mode option can have the following values:

min

Pick the lowest value.

max

Pick the highest value.

sum

Use the sum of all values as sort value. Only applicable for number based array fields.

avg

Use the average of all values as sort value. Only applicable for number based array fields.

median

Use the median of all values as sort value. Only applicable for number based array fields.

The default sort mode in the ascending sort order is min — the lowest value is picked. The default sort mode in the descending order is max — the highest value is picked.

Sort mode example usage
edit

In the example below the field price has multiple prices per document. In this case the result hits will be sorted by price ascending based on the average price per document.

PUT /my_index/_doc/1?refresh
{
   "product": "chocolate",
   "price": [20, 4]
}

POST /_search
{
   "query" : {
      "term" : { "product" : "chocolate" }
   },
   "sort" : [
      {"price" : {"order" : "asc", "mode" : "avg"}}
   ]
}

Sorting numeric fields

edit

For numeric fields it is also possible to cast the values from one type to another using the numeric_type option. This option accepts the following values: ["double", "long", "date", "date_nanos"] and can be useful for cross-index search if the sort field is mapped differently on some indices.

Consider for instance these two indices:

PUT /index_double
{
    "mappings": {
        "properties": {
            "field": { "type": "double" }
        }
    }
}
PUT /index_long
{
    "mappings": {
        "properties": {
            "field": { "type": "long" }
        }
    }
}

Since field is mapped as a double in the first index and as a long in the second index, it is not possible to use this field to sort requests that query both indices by default. However you can force the type to one or the other with the numeric_type option in order to force a specific type for all indices:

POST /index_long,index_double/_search
{
   "sort" : [
      {
        "field" : {
            "numeric_type" : "double"
        }
      }
   ]
}

In the example above, values for the index_long index are casted to a double in order to be compatible with the values produced by the index_double index. It is also possible to transform a floating point field into a long but note that in this case floating points are replaced by the largest value that is less than or equal (greater than or equal if the value is negative) to the argument and is equal to a mathematical integer.

This option can also be used to convert a date field that uses millisecond resolution to a date_nanos field with nanosecond resolution. Consider for instance these two indices:

PUT /index_double
{
    "mappings": {
        "properties": {
            "field": { "type": "date" }
        }
    }
}
PUT /index_long
{
    "mappings": {
        "properties": {
            "field": { "type": "date_nanos" }
        }
    }
}

Values in these indices are stored with different resolutions so sorting on these fields will always sort the date before the date_nanos (ascending order). With the numeric_type type option it is possible to set a single resolution for the sort, setting to date will convert the date_nanos to the millisecond resolution while date_nanos will convert the values in the date field to the nanoseconds resolution:

POST /index_long,index_double/_search
{
   "sort" : [
      {
        "field" : {
            "numeric_type" : "date_nanos"
        }
      }
   ]
}

To avoid overflow, the conversion to date_nanos cannot be applied on dates before 1970 and after 2262 as nanoseconds are represented as longs.

Sorting within nested objects.

edit

Elasticsearch also supports sorting by fields that are inside one or more nested objects. The sorting by nested field support has a nested sort option with the following properties:

path
Defines on which nested object to sort. The actual sort field must be a direct field inside this nested object. When sorting by nested field, this field is mandatory.
filter
A filter that the inner objects inside the nested path should match with in order for its field values to be taken into account by sorting. Common case is to repeat the query / filter inside the nested filter or query. By default no nested_filter is active.
max_children
The maximum number of children to consider per root document when picking the sort value. Defaults to unlimited.
nested
Same as top-level nested but applies to another nested path within the current nested object.

Nested sort options before Elasticsearch 6.1

The nested_path and nested_filter options have been deprecated in favor of the options documented above.

Nested sorting examples
edit

In the below example offer is a field of type nested. The nested path needs to be specified; otherwise, Elasticsearch doesn’t know on what nested level sort values need to be captured.

POST /_search
{
   "query" : {
      "term" : { "product" : "chocolate" }
   },
   "sort" : [
       {
          "offer.price" : {
             "mode" :  "avg",
             "order" : "asc",
             "nested": {
                "path": "offer",
                "filter": {
                   "term" : { "offer.color" : "blue" }
                }
             }
          }
       }
    ]
}

In the below example parent and child fields are of type nested. The nested_path needs to be specified at each level; otherwise, Elasticsearch doesn’t know on what nested level sort values need to be captured.

POST /_search
{
   "query": {
      "nested": {
         "path": "parent",
         "query": {
            "bool": {
                "must": {"range": {"parent.age": {"gte": 21}}},
                "filter": {
                    "nested": {
                        "path": "parent.child",
                        "query": {"match": {"parent.child.name": "matt"}}
                    }
                }
            }
         }
      }
   },
   "sort" : [
      {
         "parent.child.age" : {
            "mode" :  "min",
            "order" : "asc",
            "nested": {
               "path": "parent",
               "filter": {
                  "range": {"parent.age": {"gte": 21}}
               },
               "nested": {
                  "path": "parent.child",
                  "filter": {
                     "match": {"parent.child.name": "matt"}
                  }
               }
            }
         }
      }
   ]
}

Nested sorting is also supported when sorting by scripts and sorting by geo distance.

Missing Values

edit

The missing parameter specifies how docs which are missing the sort field should be treated: The missing value can be set to _last, _first, or a custom value (that will be used for missing docs as the sort value). The default is _last.

For example:

GET /_search
{
    "sort" : [
        { "price" : {"missing" : "_last"} }
    ],
    "query" : {
        "term" : { "product" : "chocolate" }
    }
}

If a nested inner object doesn’t match with the nested_filter then a missing value is used.

Ignoring Unmapped Fields

edit

By default, the search request will fail if there is no mapping associated with a field. The unmapped_type option allows you to ignore fields that have no mapping and not sort by them. The value of this parameter is used to determine what sort values to emit. Here is an example of how it can be used:

GET /_search
{
    "sort" : [
        { "price" : {"unmapped_type" : "long"} }
    ],
    "query" : {
        "term" : { "product" : "chocolate" }
    }
}

If any of the indices that are queried doesn’t have a mapping for price then Elasticsearch will handle it as if there was a mapping of type long, with all documents in this index having no value for this field.

Geo Distance Sorting

edit

Allow to sort by _geo_distance. Here is an example, assuming pin.location is a field of type geo_point:

GET /_search
{
    "sort" : [
        {
            "_geo_distance" : {
                "pin.location" : [-70, 40],
                "order" : "asc",
                "unit" : "km",
                "mode" : "min",
                "distance_type" : "arc",
                "ignore_unmapped": true
            }
        }
    ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}
distance_type
How to compute the distance. Can either be arc (default), or plane (faster, but inaccurate on long distances and close to the poles).
mode
What to do in case a field has several geo points. By default, the shortest distance is taken into account when sorting in ascending order and the longest distance when sorting in descending order. Supported values are min, max, median and avg.
unit
The unit to use when computing sort values. The default is m (meters).
ignore_unmapped
Indicates if the unmapped field should be treated as a missing value. Setting it to true is equivalent to specifying an unmapped_type in the field sort. The default is false (unmapped field cause the search to fail).

geo distance sorting does not support configurable missing values: the distance will always be considered equal to Infinity when a document does not have values for the field that is used for distance computation.

The following formats are supported in providing the coordinates:

Lat Lon as Properties
edit
GET /_search
{
    "sort" : [
        {
            "_geo_distance" : {
                "pin.location" : {
                    "lat" : 40,
                    "lon" : -70
                },
                "order" : "asc",
                "unit" : "km"
            }
        }
    ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}
Lat Lon as String
edit

Format in lat,lon.

GET /_search
{
    "sort" : [
        {
            "_geo_distance" : {
                "pin.location" : "40,-70",
                "order" : "asc",
                "unit" : "km"
            }
        }
    ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}
Geohash
edit
GET /_search
{
    "sort" : [
        {
            "_geo_distance" : {
                "pin.location" : "drm3btev3e86",
                "order" : "asc",
                "unit" : "km"
            }
        }
    ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}
Lat Lon as Array
edit

Format in [lon, lat], note, the order of lon/lat here in order to conform with GeoJSON.

GET /_search
{
    "sort" : [
        {
            "_geo_distance" : {
                "pin.location" : [-70, 40],
                "order" : "asc",
                "unit" : "km"
            }
        }
    ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Multiple reference points

edit

Multiple geo points can be passed as an array containing any geo_point format, for example

GET /_search
{
    "sort" : [
        {
            "_geo_distance" : {
                "pin.location" : [[-70, 40], [-71, 42]],
                "order" : "asc",
                "unit" : "km"
            }
        }
    ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

and so forth.

The final distance for a document will then be min/max/avg (defined via mode) distance of all points contained in the document to all points given in the sort request.

Script Based Sorting

edit

Allow to sort based on custom scripts, here is an example:

GET /_search
{
    "query" : {
        "term" : { "user" : "kimchy" }
    },
    "sort" : {
        "_script" : {
            "type" : "number",
            "script" : {
                "lang": "painless",
                "source": "doc['field_name'].value * params.factor",
                "params" : {
                    "factor" : 1.1
                }
            },
            "order" : "asc"
        }
    }
}

Track Scores

edit

When sorting on a field, scores are not computed. By setting track_scores to true, scores will still be computed and tracked.

GET /_search
{
    "track_scores": true,
    "sort" : [
        { "post_date" : {"order" : "desc"} },
        { "name" : "desc" },
        { "age" : "desc" }
    ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Memory Considerations

edit

When sorting, the relevant sorted field values are loaded into memory. This means that per shard, there should be enough memory to contain them. For string based types, the field sorted on should not be analyzed / tokenized. For numeric types, if possible, it is recommended to explicitly set the type to narrower types (like short, integer and float).

Source filtering

edit

Allows to control how the _source field is returned with every hit.

By default operations return the contents of the _source field unless you have used the stored_fields parameter or if the _source field is disabled.

You can turn off _source retrieval by using the _source parameter:

To disable _source retrieval set to false:

GET /_search
{
    "_source": false,
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

The _source also accepts one or more wildcard patterns to control what parts of the _source should be returned:

For example:

GET /_search
{
    "_source": "obj.*",
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Or

GET /_search
{
    "_source": [ "obj1.*", "obj2.*" ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Finally, for complete control, you can specify both includes and excludes patterns. If includes is not empty, then only fields that match one of the patterns in includes but none of the patterns in excludes are provided in _source. If includes is empty, then all fields are provided in _source, except for those that match a pattern in excludes.

GET /_search
{
    "_source": {
        "includes": [ "obj1.*", "obj2.*" ],
        "excludes": [ "*.description" ]
    },
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Stored Fields

edit

The stored_fields parameter is about fields that are explicitly marked as stored in the mapping, which is off by default and generally not recommended. Use source filtering instead to select subsets of the original source document to be returned.

Allows to selectively load specific stored fields for each document represented by a search hit.

GET /_search
{
    "stored_fields" : ["user", "postDate"],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

* can be used to load all stored fields from the document.

An empty array will cause only the _id and _type for each hit to be returned, for example:

GET /_search
{
    "stored_fields" : [],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

If the requested fields are not stored (store mapping set to false), they will be ignored.

Stored field values fetched from the document itself are always returned as an array. On the contrary, metadata fields like _routing are never returned as an array.

Also only leaf fields can be returned via the field option. So object fields can’t be returned and such requests will fail.

Script fields can also be automatically detected and used as fields, so things like _source.obj1.field1 can be used, though not recommended, as obj1.field1 will work as well.

On its own, stored_fields cannot be used to load fields in nested objects — if a field contains a nested object in its path, then no data will be returned for that stored field. To access nested fields, stored_fields must be used within an inner_hits block.

Disable stored fields entirely

edit

To disable the stored fields (and metadata fields) entirely use: _none_:

GET /_search
{
    "stored_fields": "_none_",
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

_source and version parameters cannot be activated if _none_ is used.

Track total hits

edit

Generally the total hit count can’t be computed accurately without visiting all matches, which is costly for queries that match lots of documents. The track_total_hits parameter allows you to control how the total number of hits should be tracked. Given that it is often enough to have a lower bound of the number of hits, such as "there are at least 10000 hits", the default is set to 10,000. This means that requests will count the total hit accurately up to 10,000 hits. It’s is a good trade off to speed up searches if you don’t need the accurate number of hits after a certain threshold.

When set to true the search response will always track the number of hits that match the query accurately (e.g. total.relation will always be equal to "eq" when track_total_hits is set to true). Otherwise the "total.relation" returned in the "total" object in the search response determines how the "total.value" should be interpreted. A value of "gte" means that the "total.value" is a lower bound of the total hits that match the query and a value of "eq" indicates that "total.value" is the accurate count.

GET twitter/_search
{
    "track_total_hits": true,
     "query": {
        "match" : {
            "message" : "Elasticsearch"
        }
     }
}

... returns:

{
    "_shards": ...
    "timed_out": false,
    "took": 100,
    "hits": {
        "max_score": 1.0,
        "total" : {
            "value": 2048,    
            "relation": "eq"  
        },
        "hits": ...
    }
}

The total number of hits that match the query.

The count is accurate (e.g. "eq" means equals).

It is also possible to set track_total_hits to an integer. For instance the following query will accurately track the total hit count that match the query up to 100 documents:

GET twitter/_search
{
    "track_total_hits": 100,
     "query": {
        "match" : {
            "message" : "Elasticsearch"
        }
     }
}

The hits.total.relation in the response will indicate if the value returned in hits.total.value is accurate ("eq") or a lower bound of the total ("gte").

For instance the following response:

{
    "_shards": ...
    "timed_out": false,
    "took": 30,
    "hits" : {
        "max_score": 1.0,
        "total" : {
            "value": 42,         
            "relation": "eq"     
        },
        "hits": ...
    }
}

42 documents match the query

and the count is accurate ("eq")

... indicates that the number of hits returned in the total is accurate.

If the total number of his that match the query is greater than the value set in track_total_hits, the total hits in the response will indicate that the returned value is a lower bound:

{
    "_shards": ...
    "hits" : {
        "max_score": 1.0,
        "total" : {
            "value": 100,         
            "relation": "gte"     
        },
        "hits": ...
    }
}

There are at least 100 documents that match the query

This is a lower bound ("gte").

If you don’t need to track the total number of hits at all you can improve query times by setting this option to false:

GET twitter/_search
{
    "track_total_hits": false,
     "query": {
        "match" : {
            "message" : "Elasticsearch"
        }
     }
}

... returns:

{
    "_shards": ...
    "timed_out": false,
    "took": 10,
    "hits" : { 
        "max_score": 1.0,
        "hits": ...
    }
}

The total number of hits is unknown.

Finally you can force an accurate count by setting "track_total_hits" to true in the request.

Version

edit

Returns a version for each search hit.

GET /_search
{
    "version": true,
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}