WARNING: Version 1.5 of Elasticsearch has passed its EOL date.

This documentation is no longer being maintained and may be removed. If you are running this version, we strongly advise you to upgrade. For the latest information, see the current release documentation.

« Synonym Token Filter Reverse Token Filter »

› › ›

Compound Word Token Filter

edit

IMPORTANT: This documentation is no longer updated. Refer to Elastic's version policy and the latest documentation.

Compound Word Token Filter

edit

Token filters that allow to decompose compound words. There are two types available: dictionary_decompounder and hyphenation_decompounder.

The following are settings that can be set for a compound word token filter type:

Setting	Description
`word_list`	A list of words to use.
`word_list_path`	A path (either relative to `config` location, or absolute) to a list of words.
`hyphenation_patterns_path`	A path (either relative to `config` location, or absolute) to a FOP XML hyphenation pattern file. (See http://offo.sourceforge.net/hyphenation/) Required for `hyphenation_decompounder`.
`min_word_size`	Minimum word size(Integer). Defaults to 5.
`min_subword_size`	Minimum subword size(Integer). Defaults to 2.
`max_subword_size`	Maximum subword size(Integer). Defaults to 15.
`only_longest_match`	Only matching the longest(Boolean). Defaults to `false`

Here is an example:

index :
    analysis :
        analyzer :
            myAnalyzer2 :
                type : custom
                tokenizer : standard
                filter : [myTokenFilter1, myTokenFilter2]
        filter :
            myTokenFilter1 :
                type : dictionary_decompounder
                word_list: [one, two, three]
            myTokenFilter2 :
                type : hyphenation_decompounder
                word_list_path: path/to/words.txt
                max_subword_size : 22

« Synonym Token Filter Reverse Token Filter »