WARNING: Version 6.1 of Elasticsearch has passed its EOL date.
This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.
Tokenizers
editTokenizers
editA tokenizer receives a stream of characters, breaks it up into individual
tokens (usually individual words), and outputs a stream of tokens. For
instance, a whitespace tokenizer breaks
text into tokens whenever it sees any whitespace. It would convert the text
"Quick brown fox!" into the terms [Quick, brown, fox!].
The tokenizer is also responsible for recording the order or position of each term (used for phrase and word proximity queries) and the start and end character offsets of the original word which the term represents (used for highlighting search snippets).
Elasticsearch has a number of built in tokenizers which can be used to build custom analyzers.
Word Oriented Tokenizers
editThe following tokenizers are usually used for tokenizing full text into individual words:
- Standard Tokenizer
-
The
standardtokenizer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation symbols. It is the best choice for most languages. - Letter Tokenizer
-
The
lettertokenizer divides text into terms whenever it encounters a character which is not a letter. - Lowercase Tokenizer
-
The
lowercasetokenizer, like thelettertokenizer, divides text into terms whenever it encounters a character which is not a letter, but it also lowercases all terms. - Whitespace Tokenizer
-
The
whitespacetokenizer divides text into terms whenever it encounters any whitespace character. - UAX URL Email Tokenizer
-
The
uax_url_emailtokenizer is like thestandardtokenizer except that it recognises URLs and email addresses as single tokens. - Classic Tokenizer
-
The
classictokenizer is a grammar based tokenizer for the English Language. - Thai Tokenizer
-
The
thaitokenizer segments Thai text into words.
Partial Word Tokenizers
editThese tokenizers break up text or words into small fragments, for partial word matching:
- N-Gram Tokenizer
-
The
ngramtokenizer can break up text into words when it encounters any of a list of specified characters (e.g. whitespace or punctuation), then it returns n-grams of each word: a sliding window of continuous letters, e.g.quick→[qu, ui, ic, ck]. - Edge N-Gram Tokenizer
-
The
edge_ngramtokenizer can break up text into words when it encounters any of a list of specified characters (e.g. whitespace or punctuation), then it returns n-grams of each word which are anchored to the start of the word, e.g.quick→[q, qu, qui, quic, quick].
Structured Text Tokenizers
editThe following tokenizers are usually used with structured text like identifiers, email addresses, zip codes, and paths, rather than with full text:
- Keyword Tokenizer
-
The
keywordtokenizer is a “noop” tokenizer that accepts whatever text it is given and outputs the exact same text as a single term. It can be combined with token filters likelowercaseto normalise the analysed terms. - Pattern Tokenizer
-
The
patterntokenizer uses a regular expression to either split text into terms whenever it matches a word separator, or to capture matching text as terms. - Simple Pattern Tokenizer
-
The
simple_patterntokenizer uses a regular expression to capture matching text as terms. It uses a restricted subset of regular expression features and is generally faster than thepatterntokenizer. - Simple Pattern Split Tokenizer
-
The
simple_pattern_splittokenizer uses the same restricted regular expression subset as thesimple_patterntokenizer, but splits the input at matches rather than returning the matches as terms. - Path Tokenizer
-
The
path_hierarchytokenizer takes a hierarchical value like a filesystem path, splits on the path separator, and emits a term for each component in the tree, e.g./foo/bar/baz→[/foo, /foo/bar, /foo/bar/baz ].