Standard Tokenizer

edit

A tokenizer of type standard providing grammar based tokenizer that is a good tokenizer for most European language documents. The tokenizer implements the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.

The following are settings that can be set for a standard tokenizer type:

Setting Description

max_token_length

The maximum token length. If a token is seen that exceeds this length then it is discarded. Defaults to 255.