IMPORTANT: No additional bug fixes or documentation updates
will be released for this version. For the latest information, see the
current release documentation.
ICU Tokenizer
edit
IMPORTANT: This documentation is no longer updated. Refer to Elastic's version policy and the latest documentation.
ICU Tokenizer
editTokenizes text into words on word boundaries, as defined in
UAX #29: Unicode Text Segmentation.
It behaves much like the standard tokenizer,
but adds better support for some Asian languages by using a dictionary-based
approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and
using custom rules to break Myanmar and Khmer text into syllables.
PUT icu_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_icu_analyzer": {
"tokenizer": "icu_tokenizer"
}
}
}
}
}
}