WARNING: Version 5.2 of Elasticsearch has passed its EOL date.

This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.

› › ›

Custom Analyzer

edit

IMPORTANT: This documentation is no longer updated. Refer to Elastic's version policy and the latest documentation.

Custom Analyzer

edit

When the built-in analyzers do not fulfill your needs, you can create a custom analyzer which uses the appropriate combination of:

zero or more character filters
a tokenizer
zero or more token filters.

Configuration

edit

The custom analyzer accepts the following parameters:

`tokenizer`	A built-in or customised tokenizer. (Required)
`char_filter`	An optional array of built-in or customised character filters.
`filter`	An optional array of built-in or customised token filters.
`position_increment_gap`	When indexing an array of text values, Elasticsearch inserts a fake "gap" between the last term of one value and the first term of the next value to ensure that a phrase query doesn’t match two terms from different array elements. Defaults to `100`. See `position_increment_gap` for more.

Example configuration

edit

Here is an example that combines the following:

Character Filter

HTML Strip Character Filter

Tokenizer

Standard Tokenizer

Token Filters

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type":      "custom",
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "Is this <b>déjà vu</b>?"
}

The above example produces the following terms:

[ is, this, deja, vu ]

The previous example used tokenizer, token filters, and character filters with their default configurations, but it is possible to create configured versions of each and to use them in a custom analyzer.

Here is a more complicated example that combines the following:

Character Filter

Mapping Character Filter, configured to replace :) with _happy_ and :( with _sad_

Tokenizer

Pattern Tokenizer, configured to split on punctuation characters

Token Filters

Lowercase Token Filter
Stop Token Filter, configured to use the pre-defined list of English stop words

Here is an example:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "char_filter": [
            "emoticons" 
          ],
          "tokenizer": "punctuation", 
          "filter": [
            "lowercase",
            "english_stop" 
          ]
        }
      },
      "tokenizer": {
        "punctuation": { 
          "type": "pattern",
          "pattern": "[ .,!?]"
        }
      },
      "char_filter": {
        "emoticons": { 
          "type": "mapping",
          "mappings": [
            ":) => _happy_",
            ":( => _sad_"
          ]
        }
      },
      "filter": {
        "english_stop": { 
          "type": "stop",
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text":     "I'm a :) person, and you?"
}

The emoticon character filter, punctuation tokenizer and english_stop token filter are custom implementations which are defined in the same index settings.

The above example produces the following terms:

[ i'm, _happy_, person, you ]

« Fingerprint Analyzer Normalizers »