IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

« Simple pattern split tokenizer Thai tokenizer »

› › ›

Standard tokenizer

edit

IMPORTANT: This documentation is no longer updated. Refer to Elastic's version policy and the latest documentation.

Standard tokenizer

edit

The standard tokenizer provides grammar based tokenization (based on the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29) and works well for most languages.

Example output

edit

POST _analyze
{
  "tokenizer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

The above sentence would produce the following terms:

[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone ]

Configuration

edit

The standard tokenizer accepts the following parameters:

max_token_length

The maximum token length. If a token is seen that exceeds this length then it is split at max_token_length intervals. Defaults to 255.

Example configuration

edit

In this example, we configure the standard tokenizer to have a max_token_length of 5 (for demonstration purposes):

PUT my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "standard",
          "max_token_length": 5
        }
      }
    }
  }
}

POST my-index-000001/_analyze
{
  "analyzer": "my_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

The above example produces the following terms:

[ The, 2, QUICK, Brown, Foxes, jumpe, d, over, the, lazy, dog's, bone ]

« Simple pattern split tokenizer Thai tokenizer »