CJK Bigram Token Filter

edit

The cjk_bigram token filter forms bigrams out of the CJK terms that are generated by the standard tokenizer or the icu_tokenizer (see analysis-icu plugin).

By default, when a CJK character has no adjacent characters to form a bigram, it is output in unigram form. If you always want to output both unigrams and bigrams, set the output_unigrams flag to true. This can be used for a combined unigram+bigram approach.

Bigrams are generated for characters in han, hiragana, katakana and hangul, but bigrams can be disabled for particular scripts with the ignored_scripts parameter. All non-CJK input is passed through unmodified.

{
    "index" : {
        "analysis" : {
            "analyzer" : {
                "han_bigrams" : {
                    "tokenizer" : "standard",
                    "filter" : ["han_bigrams_filter"]
                }
            },
            "filter" : {
                "han_bigrams_filter" : {
                    "type" : "cjk_bigram",
                    "ignored_scripts": [
                        "hiragana",
                        "katakana",
                        "hangul"
                    ],
                    "output_unigrams" : true
                }
            }
        }
    }
}