Edge n-gram tokenizer
editEdge n-gram tokenizer
editThe edge_ngram tokenizer first breaks text down into words whenever it
encounters one of a list of specified characters, then it emits
N-grams of each word where the start of
the N-gram is anchored to the beginning of the word.
Edge N-Grams are useful for search-as-you-type queries.
When you need search-as-you-type for text which has a widely known order, such as movie or song titles, the completion suggester is a much more efficient choice than edge N-grams. Edge N-grams have the advantage when trying to autocomplete words that can appear in any order.
Example output
editWith the default settings, the edge_ngram tokenizer treats the initial text as a
single token and produces N-grams with minimum length 1 and maximum length
2:
POST _analyze
{
"tokenizer": "edge_ngram",
"text": "Quick Fox"
}
The above sentence would produce the following terms:
[ Q, Qu ]
These default gram lengths are almost entirely useless. You need to
configure the edge_ngram before using it.
Configuration
editThe edge_ngram tokenizer accepts the following parameters:
-
min_gram -
Minimum length of characters in a gram. Defaults to
1. -
max_gram -
Maximum length of characters in a gram. Defaults to
2. -
token_chars -
Character classes that should be included in a token. Elasticsearch will split on characters that don’t belong to the classes specified. Defaults to
[](keep all characters).Character classes may be any of the following:
-
letter— for examplea,b,ïor京 -
digit— for example3or7 -
whitespace— for example" "or"\n" -
punctuation— for example!or" -
symbol— for example$or√
-