IMPORTANT: No additional bug fixes or documentation updates will be released for this version. For the latest information, see the current release documentation.

« ICU tokenizer ICU folding token filter »

› › ›

ICU normalization token filter

edit

IMPORTANT: This documentation is no longer updated. Refer to Elastic's version policy and the latest documentation.

ICU normalization token filter

edit

Normalizes characters as explained here. It registers itself as the icu_normalizer token filter, which is available to all indices without any further configuration. The type of normalization can be specified with the name parameter, which accepts nfc, nfkc, and nfkc_cf (default).

Which letters are normalized can be controlled by specifying the unicode_set_filter parameter, which accepts a UnicodeSet.

You should probably prefer the Normalization character filter.

Here are two examples, the default usage and a customised token filter:

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "nfkc_cf_normalized": { 
            "tokenizer": "icu_tokenizer",
            "filter": [
              "icu_normalizer"
            ]
          },
          "nfc_normalized": { 
            "tokenizer": "icu_tokenizer",
            "filter": [
              "nfc_normalizer"
            ]
          }
        },
        "filter": {
          "nfc_normalizer": {
            "type": "icu_normalizer",
            "name": "nfc"
          }
        }
      }
    }
  }
}

	Uses the default `nfkc_cf` normalization.
	Uses the customized `nfc_normalizer` token filter, which is set to use `nfc` normalization.

« ICU tokenizer ICU folding token filter »