Elasticsearch Guide: other versions:
Elasticsearch introduction
- Data in: documents and indices
- Information out: search and analyze
- Scalability and resilience
Getting started with Elasticsearch
- Get Elasticsearch up and running
- Index some documents
- Start searching
- Analyze results with aggregations
- Where to go from here
Set up Elasticsearch
- Installing Elasticsearch
- Configuring Elasticsearch
- Important Elasticsearch configuration
- Important System Configuration
- Bootstrap Checks
- Starting Elasticsearch
- Stopping Elasticsearch
- Adding nodes to your cluster
- Set up X-Pack
- Configuring X-Pack Java Clients
- Bootstrap Checks for X-Pack
Upgrade Elasticsearch
- Rolling upgrades
- Full cluster restart upgrade
- Reindex before upgrading
  - Reindex in place
  - Reindex from a remote cluster
API conventions
- Multiple Indices
- Date math support in index names
- Common options
- URL-based access control
Document APIs
- Reading and Writing documents
- Index API
- Get API
- Delete API
- Delete By Query API
- Update API
- Update By Query API
- Multi Get API
- Bulk API
- Reindex API
- Term Vectors
- Multi termvectors API
- ?refresh
- Optimistic concurrency control
Search APIs
- Search
- URI Search
- Request Body Search
- Search Template
- Multi Search Template
- Search Shards API
- Suggesters
- Multi Search API
- Count API
- Validate API
- Explain API
- Profile API
- Field Capabilities API
- Ranking Evaluation API
Aggregations
- Metrics Aggregations
- Bucket Aggregations
- Pipeline Aggregations
- Matrix Aggregations
  - Matrix Stats
- Caching heavy aggregations
- Returning only aggregation results
- Aggregation Metadata
- Returning the type of the aggregation
Indices APIs
- Create Index
- Delete Index
- Get Index
- Indices Exists
- Open / Close Index API
- Shrink Index
- Split Index
- Rollover Index
- Put Mapping
- Get Mapping
- Get Field Mapping
- Types Exists
- Index Aliases
- Update Indices Settings
- Get Settings
- Analyze
  - Explain Analyze
- Index Templates
- Indices Stats
- Indices Segments
- Indices Recovery
- Indices Shard Stores
- Clear Cache
- Flush
  - Synced Flush
- Refresh
- Force Merge
cat APIs
- cat aliases
- cat allocation
- cat count
- cat fielddata
- cat health
- cat indices
- cat master
- cat nodeattrs
- cat nodes
- cat pending tasks
- cat plugins
- cat recovery
- cat repositories
- cat thread pool
- cat shards
- cat segments
- cat snapshots
- cat templates
Cluster APIs
- Cluster Health
- Cluster State
- Cluster Stats
- Pending cluster tasks
- Cluster Reroute
- Cluster Update Settings
- Cluster Get Settings
- Nodes Stats
- Nodes Info
- Nodes Feature Usage
- Remote Cluster Info
- Task Management API
- Nodes hot_threads
- Cluster Allocation Explain API
- Voting Configuration Exclusions
Query DSL
- Query and filter context
- Compound queries
- Full text queries
- Geo queries
- Joining queries
  - Nested
  - Has child
  - Has parent
  - Parent ID
- Match all
- Span queries
- Specialized queries
- Term-level queries
  - Exists
  - Fuzzy
  - IDs
  - Prefix
  - Range
  - Regexp
  - Term
  - Terms
  - Terms set
  - Type Query
  - Wildcard
- minimum_should_match parameter
- rewrite parameter
- Regular expression syntax
Scripting
- How to use scripts
- Accessing document fields and special variables
- Scripting and security
- Painless scripting language
- Lucene expressions language
- Advanced scripts using script engines
Mapping
- Removal of mapping types
- Field datatypes
  - Alias
  - Arrays
  - Binary
  - Boolean
  - Date
  - Date nanoseconds
  - Dense vector
  - Geo-point
  - Geo-shape
  - IP
  - Join
  - Keyword
  - Nested
  - Numeric
  - Object
  - Percolator
  - Range
  - Rank feature
  - Rank features
  - Sparse vector
  - Text
  - Token count
- Meta-Fields
- Mapping parameters
- Dynamic Mapping
  - Dynamic field mapping
  - Dynamic templates
Analysis
- Anatomy of an analyzer
- Testing analyzers
- Analyzers
- Normalizers
- Tokenizers
- Token Filters
- Character Filters
Modules
- Discovery and cluster formation
- Shard allocation and cluster-level routing
- Local Gateway
  - Dangling indices
- HTTP
- Indices
- Network Settings
- Node
- Plugins
- Snapshot and Restore
- Thread Pool
- Transport
- Remote clusters
- Cross-cluster search
Index modules
- Analysis
- Index Shard Allocation
- Mapper
- Merge
- Similarity module
- Slow Log
- Store
  - Preloading data into the file system cache
- Translog
- Index Sorting
  - Use index sorting to speed up conjunctions
Ingest node
- Pipeline Definition
- Ingest APIs
- Accessing Data in Pipelines
- Conditional Execution in Pipelines
- Handling Failures in Pipelines
- Processors
Managing the index lifecycle
- Getting started with index lifecycle management
- Policy phases and actions
- Set up index lifecycle management policy
  - Applying a policy to an index template
  - Apply a policy to a create index request
- Using policies to manage index rollover
  - Skipping Rollover
- Update policy
- Index lifecycle error handling
- Restoring snapshots of managed indices
- Start and stop index lifecycle management
- Using ILM with existing indices
  - Managing existing periodic indices with ILM
  - Reindexing via ILM
SQL access
- Overview
- Getting Started with SQL
- Conventions and Terminology
  - Mapping concepts across SQL and Elasticsearch
- Security
- SQL REST API
- SQL Translate API
- SQL CLI
- SQL JDBC
  - API usage
- SQL ODBC
  - Driver installation
  - Configuration
- SQL Client Applications
- SQL Language
- Functions and Operators
- Reserved keywords
- SQL Limitations
Monitor a cluster
- Overview
- How it works
- Monitoring in a production environment
- Elastic Stack Monitoring Service
- Collecting monitoring data
  - Pausing data collection
- Collecting monitoring data with Metricbeat
- Configuring indices for monitoring
- Collectors
- Exporters
  - Local exporters
  - HTTP exporters
- Troubleshooting
Rolling up historical data
- Overview
- API Quick Reference
- Getting Started
- Understanding Groups
  - Grouping Limitations with heterogeneous indices
  - Doc counts and overlapping jobs
- Rollup Aggregation Limitations
- Rollup Search Limitations
Frozen indices
- Best practices
- Searching a frozen index
- Monitoring frozen indices
Set up a cluster for high availability
- Back up a cluster
- Cross-cluster replication
X-Pack APIs
- Info API
- Cross-cluster replication APIs
- Explore API
- Freeze index
- Index lifecycle management API
- Licensing APIs
- Migration APIs
  - Deprecation info
- Machine learning APIs
- Rollup APIs
- Security APIs
- Unfreeze index
- Watcher APIs
  - Put watch
  - Get watch
  - Delete watch
  - Execute watch
  - Ack watch
  - Activate watch
  - Deactivate watch
  - Stats
  - Stop
  - Start
- Definitions
Secure a cluster
- Overview
- Configuring security
- How security works
- User authentication
- Configuring SAML single-sign-on on the Elastic Stack
- User authorization
- Auditing security events
- Encrypting communications
  - Setting Up TLS on a cluster
- Restricting connections with IP filtering
- Cross cluster search, clients, and integrations
- Tutorial: Getting started with security
- Tutorial: Encrypting communications
- Troubleshooting
- Limitations
Alerting on cluster and index events
- Getting started with Watcher
- How Watcher works
- Encrypting sensitive data in Watcher
- Inputs
- Triggers
  - Schedule trigger
- Conditions
- Actions
- Transforms
- Java API
- Managing watches
- Example watches
  - Watching the status of an Elasticsearch cluster
  - Watching event data
- Watcher
- Watcher limitations
Command line tools
- elasticsearch-certgen
- elasticsearch-certutil
- elasticsearch-migrate
- elasticsearch-node
- elasticsearch-saml-metadata
- elasticsearch-setup-passwords
- elasticsearch-shard
- elasticsearch-syskeygen
- elasticsearch-users
How To
- General recommendations
- Recipes
- Tune for indexing speed
- Tune for search speed
- Tune for disk usage
Testing
- Java Testing Framework
Glossary of terms
Release highlights
- 7.0.0
Breaking changes
- 7.0
Release notes
- Elasticsearch version 7.0.1
- Elasticsearch version 7.0.0
- Elasticsearch version 7.0.0-rc2
- Elasticsearch version 7.0.0-rc1
- Elasticsearch version 7.0.0-beta1
- Elasticsearch version 7.0.0-alpha2
- Elasticsearch version 7.0.0-alpha1

IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

› › ›

Phrase Suggester

edit

IMPORTANT: This documentation is no longer updated. Refer to Elastic's version policy and the latest documentation.

Phrase Suggester

edit

In order to understand the format of suggestions, please read the Suggesters page first.

The term suggester provides a very convenient API to access word alternatives on a per token basis within a certain string distance. The API allows accessing each token in the stream individually while suggest-selection is left to the API consumer. Yet, often pre-selected suggestions are required in order to present to the end-user. The phrase suggester adds additional logic on top of the term suggester to select entire corrected phrases instead of individual tokens weighted based on ngram-language models. In practice this suggester will be able to make better decisions about which tokens to pick based on co-occurrence and frequencies.

API Example

edit

In general the phrase suggester requires special mapping up front to work. The phrase suggester examples on this page need the following mapping to work. The reverse analyzer is used only in the last example.

PUT test
{
  "settings": {
    "index": {
      "number_of_shards": 1,
      "analysis": {
        "analyzer": {
          "trigram": {
            "type": "custom",
            "tokenizer": "standard",
            "filter": ["lowercase","shingle"]
          },
          "reverse": {
            "type": "custom",
            "tokenizer": "standard",
            "filter": ["lowercase","reverse"]
          }
        },
        "filter": {
          "shingle": {
            "type": "shingle",
            "min_shingle_size": 2,
            "max_shingle_size": 3
          }
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "fields": {
          "trigram": {
            "type": "text",
            "analyzer": "trigram"
          },
          "reverse": {
            "type": "text",
            "analyzer": "reverse"
          }
        }
      }
    }
  }
}
POST test/_doc?refresh=true
{"title": "noble warriors"}
POST test/_doc?refresh=true
{"title": "nobel prize"}

Copy as curl Try in Elastic

Once you have the analyzers and mappings set up you can use the phrase suggester in the same spot you’d use the term suggester:

POST test/_search
{
  "suggest": {
    "text": "noble prize",
    "simple_phrase": {
      "phrase": {
        "field": "title.trigram",
        "size": 1,
        "gram_size": 3,
        "direct_generator": [ {
          "field": "title.trigram",
          "suggest_mode": "always"
        } ],
        "highlight": {
          "pre_tag": "<em>",
          "post_tag": "</em>"
        }
      }
    }
  }
}

Copy as curl Try in Elastic

The response contains suggestions scored by the most likely spell correction first. In this case we received the expected correction "nobel prize".

{
  "_shards": ...
  "hits": ...
  "timed_out": false,
  "took": 3,
  "suggest": {
    "simple_phrase" : [
      {
        "text" : "noble prize",
        "offset" : 0,
        "length" : 11,
        "options" : [ {
          "text" : "nobel prize",
          "highlighted": "<em>nobel</em> prize",
          "score" : 0.48614594
        }]
      }
    ]
  }
}

Basic Phrase suggest API parameters

edit

`field`	The name of the field used to do n-gram lookups for the language model, the suggester will use this field to gain statistics to score corrections. This field is mandatory.
`gram_size`	Sets max size of the n-grams (shingles) in the `field`. If the field doesn’t contain n-grams (shingles), this should be omitted or set to `1`. Note that Elasticsearch tries to detect the gram size based on the specified `field`. If the field uses a `shingle` filter, the `gram_size` is set to the `max_shingle_size` if not explicitly set.
`real_word_error_likelihood`	The likelihood of a term being a misspelled even if the term exists in the dictionary. The default is `0.95`, meaning 5% of the real words are misspelled.
`confidence`	The confidence level defines a factor applied to the input phrases score which is used as a threshold for other suggest candidates. Only candidates that score higher than the threshold will be included in the result. For instance a confidence level of `1.0` will only return suggestions that score higher than the input phrase. If set to `0.0` the top N candidates are returned. The default is `1.0`.
`max_errors`	The maximum percentage of the terms considered to be misspellings in order to form a correction. This method accepts a float value in the range `[0..1)` as a fraction of the actual query terms or a number `>=1` as an absolute number of query terms. The default is set to `1.0`, meaning only corrections with at most one misspelled term are returned. Note that setting this too high can negatively impact performance. Low values like `1` or `2` are recommended; otherwise the time spend in suggest calls might exceed the time spend in query execution.
`separator`	The separator that is used to separate terms in the bigram field. If not set the whitespace character is used as a separator.
`size`	The number of candidates that are generated for each individual query term. Low numbers like `3` or `5` typically produce good results. Raising this can bring up terms with higher edit distances. The default is `5`.
`analyzer`	Sets the analyzer to analyze to suggest text with. Defaults to the search analyzer of the suggest field passed via `field`.
`shard_size`	Sets the maximum number of suggested terms to be retrieved from each individual shard. During the reduce phase, only the top N suggestions are returned based on the `size` option. Defaults to `5`.
`text`	Sets the text / query to provide suggestions for.
`highlight`	Sets up suggestion highlighting. If not provided then no `highlighted` field is returned. If provided must contain exactly `pre_tag` and `post_tag`, which are wrapped around the changed tokens. If multiple tokens in a row are changed the entire phrase of changed tokens is wrapped rather than each token.
`collate`	Checks each suggestion against the specified `query` to prune suggestions for which no matching docs exist in the index. The collate query for a suggestion is run only on the local shard from which the suggestion has been generated from. The `query` must be specified and it can be templated, see search templates for more information. The current suggestion is automatically made available as the `{{suggestion}}` variable, which should be used in your query. You can still specify your own template `params` — the `suggestion` value will be added to the variables you specify. Additionally, you can specify a `prune` to control if all phrase suggestions will be returned; when set to `true` the suggestions will have an additional option `collate_match`, which will be `true` if matching documents for the phrase was found, `false` otherwise. The default value for `prune` is `false`.

POST _search
{
  "suggest": {
    "text" : "noble prize",
    "simple_phrase" : {
      "phrase" : {
        "field" :  "title.trigram",
        "size" :   1,
        "direct_generator" : [ {
          "field" :            "title.trigram",
          "suggest_mode" :     "always",
          "min_word_length" :  1
        } ],
        "collate": {
          "query": { 
            "source" : {
              "match": {
                "{{field_name}}" : "{{suggestion}}" 
              }
            }
          },
          "params": {"field_name" : "title"}, 
          "prune": true 
        }
      }
    }
  }
}

Copy as curl Try in Elastic

	This query will be run once for every suggestion.
	The `{{suggestion}}` variable will be replaced by the text of each suggestion.
	An additional `field_name` variable has been specified in `params` and is used by the `match` query.
	All suggestions will be returned with an extra `collate_match` option indicating whether the generated phrase matched any document.

Smoothing Models

edit

The phrase suggester supports multiple smoothing models to balance weight between infrequent grams (grams (shingles) are not existing in the index) and frequent grams (appear at least once in the index). The smoothing model can be selected by setting the smoothing parameter to one of the following options. Each smoothing model supports specific properties that can be configured.

`stupid_backoff`	A simple backoff model that backs off to lower order n-gram models if the higher order count is `0` and discounts the lower order n-gram model by a constant factor. The default `discount` is `0.4`. Stupid Backoff is the default model.
`laplace`	A smoothing model that uses an additive smoothing where a constant (typically `1.0` or smaller) is added to all counts to balance weights. The default `alpha` is `0.5`.
`linear_interpolation`	A smoothing model that takes the weighted mean of the unigrams, bigrams, and trigrams based on user supplied weights (lambdas). Linear Interpolation doesn’t have any default values. All parameters (`trigram_lambda`, `bigram_lambda`, `unigram_lambda`) must be supplied.

POST _search
{
  "suggest": {
    "text" : "obel prize",
    "simple_phrase" : {
      "phrase" : {
        "field" : "title.trigram",
        "size" : 1,
        "smoothing" : {
          "laplace" : {
            "alpha" : 0.7
          }
        }
      }
    }
  }
}

Copy as curl Try in Elastic

Candidate Generators

edit

The phrase suggester uses candidate generators to produce a list of possible terms per term in the given text. A single candidate generator is similar to a term suggester called for each individual term in the text. The output of the generators is subsequently scored in combination with the candidates from the other terms for suggestion candidates.

Currently only one type of candidate generator is supported, the direct_generator. The Phrase suggest API accepts a list of generators under the key direct_generator; each of the generators in the list is called per term in the original text.

Direct Generators

edit

The direct generators support the following parameters:

`field`	The field to fetch the candidate suggestions from. This is a required option that either needs to be set globally or per suggestion.
`size`	The maximum corrections to be returned per suggest text token.
`suggest_mode`	The suggest mode controls what suggestions are included on the suggestions generated on each shard. All values other than `always` can be thought of as an optimization to generate fewer suggestions to test on each shard and are not rechecked when combining the suggestions generated on each shard. Thus `missing` will generate suggestions for terms on shards that do not contain them even if other shards do contain them. Those should be filtered out using `confidence`. Three possible values can be specified: `missing`: Only generate suggestions for terms that are not in the shard. This is the default. `popular`: Only suggest terms that occur in more docs on the shard than the original term. `always`: Suggest any matching suggestions based on terms in the suggest text.
`max_edits`	The maximum edit distance candidate suggestions can have in order to be considered as a suggestion. Can only be a value between 1 and 2. Any other value results in a bad request error being thrown. Defaults to 2.
`prefix_length`	The number of minimal prefix characters that must match in order be a candidate suggestions. Defaults to 1. Increasing this number improves spellcheck performance. Usually misspellings don’t occur in the beginning of terms. (Old name "prefix_len" is deprecated)
`min_word_length`	The minimum length a suggest text term must have in order to be included. Defaults to 4. (Old name "min_word_len" is deprecated)
`max_inspections`	A factor that is used to multiply with the `shards_size` in order to inspect more candidate spelling corrections on the shard level. Can improve accuracy at the cost of performance. Defaults to 5.
`min_doc_freq`	The minimal threshold in number of documents a suggestion should appear in. This can be specified as an absolute number or as a relative percentage of number of documents. This can improve quality by only suggesting high frequency terms. Defaults to 0f and is not enabled. If a value higher than 1 is specified, then the number cannot be fractional. The shard level document frequencies are used for this option.
`max_term_freq`	The maximum threshold in number of documents in which a suggest text token can exist in order to be included. Can be a relative percentage number (e.g., 0.4) or an absolute number to represent document frequencies. If a value higher than 1 is specified, then fractional can not be specified. Defaults to 0.01f. This can be used to exclude high frequency terms — which are usually spelled correctly — from being spellchecked. This also improves the spellcheck performance. The shard level document frequencies are used for this option.
`pre_filter`	A filter (analyzer) that is applied to each of the tokens passed to this candidate generator. This filter is applied to the original token before candidates are generated.
`post_filter`	A filter (analyzer) that is applied to each of the generated tokens before they are passed to the actual phrase scorer.

The following example shows a phrase suggest call with two generators: the first one is using a field containing ordinary indexed terms, and the second one uses a field that uses terms indexed with a reverse filter (tokens are index in reverse order). This is used to overcome the limitation of the direct generators to require a constant prefix to provide high-performance suggestions. The pre_filter and post_filter options accept ordinary analyzer names.

POST _search
{
  "suggest": {
    "text" : "obel prize",
    "simple_phrase" : {
      "phrase" : {
        "field" : "title.trigram",
        "size" : 1,
        "direct_generator" : [ {
          "field" : "title.trigram",
          "suggest_mode" : "always"
        }, {
          "field" : "title.reverse",
          "suggest_mode" : "always",
          "pre_filter" : "reverse",
          "post_filter" : "reverse"
        } ]
      }
    }
  }
}

Copy as curl Try in Elastic

pre_filter and post_filter can also be used to inject synonyms after candidates are generated. For instance for the query captain usq we might generate a candidate usa for the term usq, which is a synonym for america. This allows us to present captain america to the user if this phrase scores high enough.

« Term suggester Completion Suggester »

On this page

API Example
Basic Phrase suggest API parameters
Smoothing Models
Candidate Generators
Direct Generators

Was this helpful?

Feedback

The Search AI Company

ELK Stack

Elastic Cloud

Generative AI

Search

Security

Observability

By solution

Industries

Customer spotlight

Research

Build

Learn

Connect

Phrase Suggester

Phrase Suggester

API Example

Basic Phrase suggest API parameters

Smoothing Models

Candidate Generators

Direct Generators

Follow us

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards

About us

Join us

Partners

Trust & Security

Investor relations

Excellence Awards