- Elasticsearch Guide: other versions:
- Elasticsearch introduction
- Getting started with Elasticsearch
- Set up Elasticsearch
- Installing Elasticsearch
- Configuring Elasticsearch
- Important Elasticsearch configuration
- Important System Configuration
- Bootstrap Checks
- Heap size check
- File descriptor check
- Memory lock check
- Maximum number of threads check
- Max file size check
- Maximum size virtual memory check
- Maximum map count check
- Client JVM check
- Use serial collector check
- System call filter check
- OnError and OnOutOfMemoryError checks
- Early-access check
- G1GC check
- All permission check
- Discovery configuration check
- Starting Elasticsearch
- Stopping Elasticsearch
- Adding nodes to your cluster
- Set up X-Pack
- Configuring X-Pack Java Clients
- Bootstrap Checks for X-Pack
- Upgrade Elasticsearch
- API conventions
- Document APIs
- Search APIs
- Aggregations
- Metrics Aggregations
- Avg Aggregation
- Weighted Avg Aggregation
- Cardinality Aggregation
- Extended Stats Aggregation
- Geo Bounds Aggregation
- Geo Centroid Aggregation
- Max Aggregation
- Min Aggregation
- Percentiles Aggregation
- Percentile Ranks Aggregation
- Scripted Metric Aggregation
- Stats Aggregation
- Sum Aggregation
- Top Hits Aggregation
- Value Count Aggregation
- Median Absolute Deviation Aggregation
- Bucket Aggregations
- Adjacency Matrix Aggregation
- Auto-interval Date Histogram Aggregation
- Children Aggregation
- Composite Aggregation
- Date Histogram Aggregation
- Date Range Aggregation
- Diversified Sampler Aggregation
- Filter Aggregation
- Filters Aggregation
- Geo Distance Aggregation
- GeoHash grid Aggregation
- GeoTile Grid Aggregation
- Global Aggregation
- Histogram Aggregation
- IP Range Aggregation
- Missing Aggregation
- Nested Aggregation
- Parent Aggregation
- Range Aggregation
- Reverse nested Aggregation
- Sampler Aggregation
- Significant Terms Aggregation
- Significant Text Aggregation
- Terms Aggregation
- Pipeline Aggregations
- Avg Bucket Aggregation
- Derivative Aggregation
- Max Bucket Aggregation
- Min Bucket Aggregation
- Sum Bucket Aggregation
- Stats Bucket Aggregation
- Extended Stats Bucket Aggregation
- Percentiles Bucket Aggregation
- Moving Average Aggregation
- Moving Function Aggregation
- Cumulative Sum Aggregation
- Bucket Script Aggregation
- Bucket Selector Aggregation
- Bucket Sort Aggregation
- Serial Differencing Aggregation
- Matrix Aggregations
- Caching heavy aggregations
- Returning only aggregation results
- Aggregation Metadata
- Returning the type of the aggregation
- Metrics Aggregations
- Indices APIs
- Create Index
- Delete Index
- Get Index
- Indices Exists
- Open / Close Index API
- Shrink Index
- Split Index
- Rollover Index
- Put Mapping
- Get Mapping
- Get Field Mapping
- Types Exists
- Index Aliases
- Update Indices Settings
- Get Settings
- Analyze
- Index Templates
- Indices Stats
- Indices Segments
- Indices Recovery
- Indices Shard Stores
- Clear Cache
- Flush
- Refresh
- Force Merge
- cat APIs
- Cluster APIs
- Query DSL
- Scripting
- Mapping
- Analysis
- Anatomy of an analyzer
- Testing analyzers
- Analyzers
- Normalizers
- Tokenizers
- Standard Tokenizer
- Letter Tokenizer
- Lowercase Tokenizer
- Whitespace Tokenizer
- UAX URL Email Tokenizer
- Classic Tokenizer
- Thai Tokenizer
- NGram Tokenizer
- Edge NGram Tokenizer
- Keyword Tokenizer
- Pattern Tokenizer
- Char Group Tokenizer
- Simple Pattern Tokenizer
- Simple Pattern Split Tokenizer
- Path Hierarchy Tokenizer
- Path Hierarchy Tokenizer Examples
- Token Filters
- ASCII Folding Token Filter
- Flatten Graph Token Filter
- Length Token Filter
- Lowercase Token Filter
- Uppercase Token Filter
- NGram Token Filter
- Edge NGram Token Filter
- Porter Stem Token Filter
- Shingle Token Filter
- Stop Token Filter
- Word Delimiter Token Filter
- Word Delimiter Graph Token Filter
- Multiplexer Token Filter
- Conditional Token Filter
- Predicate Token Filter Script
- Stemmer Token Filter
- Stemmer Override Token Filter
- Keyword Marker Token Filter
- Keyword Repeat Token Filter
- KStem Token Filter
- Snowball Token Filter
- Phonetic Token Filter
- Synonym Token Filter
- Parsing synonym files
- Synonym Graph Token Filter
- Compound Word Token Filters
- Reverse Token Filter
- Elision Token Filter
- Truncate Token Filter
- Unique Token Filter
- Pattern Capture Token Filter
- Pattern Replace Token Filter
- Trim Token Filter
- Limit Token Count Token Filter
- Hunspell Token Filter
- Common Grams Token Filter
- Normalization Token Filter
- CJK Width Token Filter
- CJK Bigram Token Filter
- Delimited Payload Token Filter
- Keep Words Token Filter
- Keep Types Token Filter
- Exclude mode settings example
- Classic Token Filter
- Apostrophe Token Filter
- Decimal Digit Token Filter
- Fingerprint Token Filter
- Minhash Token Filter
- Remove Duplicates Token Filter
- Character Filters
- Modules
- Index modules
- Ingest node
- Pipeline Definition
- Ingest APIs
- Accessing Data in Pipelines
- Conditional Execution in Pipelines
- Handling Failures in Pipelines
- Processors
- Append Processor
- Bytes Processor
- Convert Processor
- Date Processor
- Date Index Name Processor
- Dissect Processor
- Dot Expander Processor
- Drop Processor
- Fail Processor
- Foreach Processor
- GeoIP Processor
- Grok Processor
- Gsub Processor
- Join Processor
- JSON Processor
- KV Processor
- Lowercase Processor
- Pipeline Processor
- Remove Processor
- Rename Processor
- Script Processor
- Set Processor
- Set Security User Processor
- Split Processor
- Sort Processor
- Trim Processor
- Uppercase Processor
- URL Decode Processor
- User Agent processor
- Managing the index lifecycle
- Getting started with index lifecycle management
- Policy phases and actions
- Set up index lifecycle management policy
- Using policies to manage index rollover
- Update policy
- Index lifecycle error handling
- Restoring snapshots of managed indices
- Start and stop index lifecycle management
- Using ILM with existing indices
- SQL access
- Monitor a cluster
- Rolling up historical data
- Frozen indices
- Set up a cluster for high availability
- X-Pack APIs
- Info API
- Cross-cluster replication APIs
- Explore API
- Freeze index
- Index lifecycle management API
- Licensing APIs
- Migration APIs
- Machine learning APIs
- Add events to calendar
- Add jobs to calendar
- Close jobs
- Create calendar
- Create datafeeds
- Create filter
- Create jobs
- Delete calendar
- Delete datafeeds
- Delete events from calendar
- Delete filter
- Delete forecast
- Delete jobs
- Delete jobs from calendar
- Delete model snapshots
- Delete expired data
- Find file structure
- Flush jobs
- Forecast jobs
- Get calendars
- Get buckets
- Get overall buckets
- Get categories
- Get datafeeds
- Get datafeed statistics
- Get influencers
- Get jobs
- Get job statistics
- Get machine learning info
- Get model snapshots
- Get scheduled events
- Get filters
- Get records
- Open jobs
- Post data to jobs
- Preview datafeeds
- Revert model snapshots
- Set upgrade mode
- Start datafeeds
- Stop datafeeds
- Update datafeeds
- Update filter
- Update jobs
- Update model snapshots
- Rollup APIs
- Security APIs
- Authenticate
- Change passwords
- Clear cache
- Clear roles cache
- Create API keys
- Create or update application privileges
- Create or update role mappings
- Create or update roles
- Create or update users
- Delete application privileges
- Delete role mappings
- Delete roles
- Delete users
- Disable users
- Enable users
- Get API key information
- Get application privileges
- Get role mappings
- Get roles
- Get token
- Get users
- Has privileges
- Invalidate API key
- Invalidate token
- SSL certificate
- Unfreeze index
- Watcher APIs
- Definitions
- Secure a cluster
- Overview
- Configuring security
- Encrypting communications in Elasticsearch
- Encrypting communications in an Elasticsearch Docker Container
- Enabling cipher suites for stronger encryption
- Separating node-to-node and client traffic
- Configuring an Active Directory realm
- Configuring a file realm
- Configuring an LDAP realm
- Configuring a native realm
- Configuring a PKI realm
- Configuring a SAML realm
- Configuring a Kerberos realm
- FIPS 140-2
- Security files
- How security works
- User authentication
- Built-in users
- Internal users
- Token-based authentication services
- Realms
- Realm chains
- Active Directory user authentication
- File-based user authentication
- LDAP user authentication
- Native user authentication
- PKI user authentication
- SAML authentication
- Kerberos authentication
- Integrating with other authentication systems
- Enabling anonymous access
- Controlling the user cache
- Configuring SAML single-sign-on on the Elastic Stack
- User authorization
- Auditing security events
- Encrypting communications
- Restricting connections with IP filtering
- Cross cluster search, clients, and integrations
- Tutorial: Getting started with security
- Tutorial: Encrypting communications
- Troubleshooting
- Some settings are not returned via the nodes settings API
- Authorization exceptions
- Users command fails due to extra arguments
- Users are frequently locked out of Active Directory
- Certificate verification fails for curl on Mac
- SSLHandshakeException causes connections to fail
- Common SSL/TLS exceptions
- Common Kerberos exceptions
- Common SAML issues
- Internal Server Error in Kibana
- Setup-passwords command fails due to connection failure
- Failures due to relocation of the configuration files
- Limitations
- Alerting on cluster and index events
- Command line tools
- How To
- Testing
- Glossary of terms
- Release highlights
- Breaking changes
- Release notes
Phrase Suggester
editPhrase Suggester
editIn order to understand the format of suggestions, please read the Suggesters page first.
The term
suggester provides a very convenient API to access word
alternatives on a per token basis within a certain string distance. The API
allows accessing each token in the stream individually while
suggest-selection is left to the API consumer. Yet, often pre-selected
suggestions are required in order to present to the end-user. The
phrase
suggester adds additional logic on top of the term
suggester
to select entire corrected phrases instead of individual tokens weighted
based on ngram-language
models. In practice this suggester will be
able to make better decisions about which tokens to pick based on
co-occurrence and frequencies.
API Example
editIn general the phrase
suggester requires special mapping up front to work.
The phrase
suggester examples on this page need the following mapping to
work. The reverse
analyzer is used only in the last example.
PUT test { "settings": { "index": { "number_of_shards": 1, "analysis": { "analyzer": { "trigram": { "type": "custom", "tokenizer": "standard", "filter": ["lowercase","shingle"] }, "reverse": { "type": "custom", "tokenizer": "standard", "filter": ["lowercase","reverse"] } }, "filter": { "shingle": { "type": "shingle", "min_shingle_size": 2, "max_shingle_size": 3 } } } } }, "mappings": { "properties": { "title": { "type": "text", "fields": { "trigram": { "type": "text", "analyzer": "trigram" }, "reverse": { "type": "text", "analyzer": "reverse" } } } } } } POST test/_doc?refresh=true {"title": "noble warriors"} POST test/_doc?refresh=true {"title": "nobel prize"}
Once you have the analyzers and mappings set up you can use the phrase
suggester in the same spot you’d use the term
suggester:
POST test/_search { "suggest": { "text": "noble prize", "simple_phrase": { "phrase": { "field": "title.trigram", "size": 1, "gram_size": 3, "direct_generator": [ { "field": "title.trigram", "suggest_mode": "always" } ], "highlight": { "pre_tag": "<em>", "post_tag": "</em>" } } } } }
The response contains suggestions scored by the most likely spell correction first. In this case we received the expected correction "nobel prize".
{ "_shards": ... "hits": ... "timed_out": false, "took": 3, "suggest": { "simple_phrase" : [ { "text" : "noble prize", "offset" : 0, "length" : 11, "options" : [ { "text" : "nobel prize", "highlighted": "<em>nobel</em> prize", "score" : 0.48614594 }] } ] } }
Basic Phrase suggest API parameters
edit
|
The name of the field used to do n-gram lookups for the language model, the suggester will use this field to gain statistics to score corrections. This field is mandatory. |
|
Sets max size of the n-grams (shingles) in the |
|
The likelihood of a term being a
misspelled even if the term exists in the dictionary. The default is
|
|
The confidence level defines a factor applied to the
input phrases score which is used as a threshold for other suggest
candidates. Only candidates that score higher than the threshold will be
included in the result. For instance a confidence level of |
|
The maximum percentage of the terms
considered to be misspellings in order to form a correction. This method
accepts a float value in the range |
|
The separator that is used to separate terms in the bigram field. If not set the whitespace character is used as a separator. |
|
The number of candidates that are generated for each
individual query term. Low numbers like |
|
Sets the analyzer to analyze to suggest text with.
Defaults to the search analyzer of the suggest field passed via |
|
Sets the maximum number of suggested terms to be
retrieved from each individual shard. During the reduce phase, only the
top N suggestions are returned based on the |
|
Sets the text / query to provide suggestions for. |
|
Sets up suggestion highlighting. If not provided then
no |
|
Checks each suggestion against the specified |
POST _search { "suggest": { "text" : "noble prize", "simple_phrase" : { "phrase" : { "field" : "title.trigram", "size" : 1, "direct_generator" : [ { "field" : "title.trigram", "suggest_mode" : "always", "min_word_length" : 1 } ], "collate": { "query": { "source" : { "match": { "{{field_name}}" : "{{suggestion}}" } } }, "params": {"field_name" : "title"}, "prune": true } } } } }
This query will be run once for every suggestion. |
|
The |
|
An additional |
|
All suggestions will be returned with an extra |
Smoothing Models
editThe phrase
suggester supports multiple smoothing models to balance
weight between infrequent grams (grams (shingles) are not existing in
the index) and frequent grams (appear at least once in the index). The
smoothing model can be selected by setting the smoothing
parameter
to one of the following options. Each smoothing model supports specific
properties that can be configured.
|
A simple backoff model that backs off to lower
order n-gram models if the higher order count is |
|
A smoothing model that uses an additive smoothing where a
constant (typically |
|
A smoothing model that takes the weighted
mean of the unigrams, bigrams, and trigrams based on user supplied
weights (lambdas). Linear Interpolation doesn’t have any default values.
All parameters ( |
POST _search { "suggest": { "text" : "obel prize", "simple_phrase" : { "phrase" : { "field" : "title.trigram", "size" : 1, "smoothing" : { "laplace" : { "alpha" : 0.7 } } } } } }
Candidate Generators
editThe phrase
suggester uses candidate generators to produce a list of
possible terms per term in the given text. A single candidate generator
is similar to a term
suggester called for each individual term in the
text. The output of the generators is subsequently scored in combination
with the candidates from the other terms for suggestion candidates.
Currently only one type of candidate generator is supported, the
direct_generator
. The Phrase suggest API accepts a list of generators
under the key direct_generator
; each of the generators in the list is
called per term in the original text.
Direct Generators
editThe direct generators support the following parameters:
|
The field to fetch the candidate suggestions from. This is a required option that either needs to be set globally or per suggestion. |
|
The maximum corrections to be returned per suggest text token. |
|
The suggest mode controls what suggestions are included on the suggestions
generated on each shard. All values other than
|
|
The maximum edit distance candidate suggestions can have in order to be considered as a suggestion. Can only be a value between 1 and 2. Any other value results in a bad request error being thrown. Defaults to 2. |
|
The number of minimal prefix characters that must match in order be a candidate suggestions. Defaults to 1. Increasing this number improves spellcheck performance. Usually misspellings don’t occur in the beginning of terms. (Old name "prefix_len" is deprecated) |
|
The minimum length a suggest text term must have in order to be included. Defaults to 4. (Old name "min_word_len" is deprecated) |
|
A factor that is used to multiply with the
|
|
The minimal threshold in number of documents a suggestion should appear in. This can be specified as an absolute number or as a relative percentage of number of documents. This can improve quality by only suggesting high frequency terms. Defaults to 0f and is not enabled. If a value higher than 1 is specified, then the number cannot be fractional. The shard level document frequencies are used for this option. |
|
The maximum threshold in number of documents in which a suggest text token can exist in order to be included. Can be a relative percentage number (e.g., 0.4) or an absolute number to represent document frequencies. If a value higher than 1 is specified, then fractional can not be specified. Defaults to 0.01f. This can be used to exclude high frequency terms — which are usually spelled correctly — from being spellchecked. This also improves the spellcheck performance. The shard level document frequencies are used for this option. |
|
A filter (analyzer) that is applied to each of the tokens passed to this candidate generator. This filter is applied to the original token before candidates are generated. |
|
A filter (analyzer) that is applied to each of the generated tokens before they are passed to the actual phrase scorer. |
The following example shows a phrase
suggest call with two generators:
the first one is using a field containing ordinary indexed terms, and the
second one uses a field that uses terms indexed with a reverse
filter
(tokens are index in reverse order). This is used to overcome the limitation
of the direct generators to require a constant prefix to provide
high-performance suggestions. The pre_filter
and post_filter
options
accept ordinary analyzer names.
POST _search { "suggest": { "text" : "obel prize", "simple_phrase" : { "phrase" : { "field" : "title.trigram", "size" : 1, "direct_generator" : [ { "field" : "title.trigram", "suggest_mode" : "always" }, { "field" : "title.reverse", "suggest_mode" : "always", "pre_filter" : "reverse", "post_filter" : "reverse" } ] } } } }
pre_filter
and post_filter
can also be used to inject synonyms after
candidates are generated. For instance for the query captain usq
we
might generate a candidate usa
for the term usq
, which is a synonym for
america
. This allows us to present captain america
to the user if this
phrase scores high enough.
On this page