NOTE: You are looking at documentation for an older release. For the latest information, see the current release documentation.
CJK Bigram Token Filter
editCJK Bigram Token Filter
editThe cjk_bigram token filter forms bigrams out of the CJK
terms that are generated by the standard tokenizer
or the icu_tokenizer (see analysis-icu plugin).
By default, when a CJK character has no adjacent characters to form a bigram,
it is output in unigram form. If you always want to output both unigrams and
bigrams, set the output_unigrams flag to true. This can be used for a
combined unigram+bigram approach.
Bigrams are generated for characters in han, hiragana, katakana and
hangul, but bigrams can be disabled for particular scripts with the
ignored_scripts parameter. All non-CJK input is passed through unmodified.
PUT /cjk_bigram_example
{
"settings" : {
"analysis" : {
"analyzer" : {
"han_bigrams" : {
"tokenizer" : "standard",
"filter" : ["han_bigrams_filter"]
}
},
"filter" : {
"han_bigrams_filter" : {
"type" : "cjk_bigram",
"ignored_scripts": [
"hiragana",
"katakana",
"hangul"
],
"output_unigrams" : true
}
}
}
}
}