New

The executive guide to generative AI

Read more
IMPORTANT: This documentation is no longer updated. Refer to Elastic's version policy and the latest documentation.

Standard Tokenizer

edit

A tokenizer of type standard providing grammar based tokenizer that is a good tokenizer for most European language documents. The tokenizer implements the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.

The following are settings that can be set for a standard tokenizer type:

Setting Description

max_token_length

The maximum token length. If a token is seen that exceeds this length then it is discarded. Defaults to 255.

Was this helpful?
Feedback