New

The executive guide to generative AI

Read more
IMPORTANT: This documentation is no longer updated. Refer to Elastic's version policy and the latest documentation.

ICU Normalization Character Filter

edit

Normalizes characters as explained here. It registers itself as the icu_normalizer character filter, which is available to all indices without any further configuration. The type of normalization can be specified with the name parameter, which accepts nfc, nfkc, and nfkc_cf (default). Set the mode parameter to decompose to convert nfc to nfd or nfkc to nfkd respectively:

Which letters are normalized can be controlled by specifying the unicode_set_filter parameter, which accepts a UnicodeSet.

Here are two examples, the default usage and a customised character filter:

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "nfkc_cf_normalized": { 
            "tokenizer": "icu_tokenizer",
            "char_filter": [
              "icu_normalizer"
            ]
          },
          "nfd_normalized": { 
            "tokenizer": "icu_tokenizer",
            "char_filter": [
              "nfd_normalizer"
            ]
          }
        },
        "char_filter": {
          "nfd_normalizer": {
            "type": "icu_normalizer",
            "name": "nfc",
            "mode": "decompose"
          }
        }
      }
    }
  }
}

Uses the default nfkc_cf normalization.

Uses the customized nfd_normalizer token filter, which is set to use nfc normalization with decomposition.

Was this helpful?
Feedback