IMPORTANT: No additional bug fixes or documentation updates
will be released for this version. For the latest information, see the
current release documentation.
Word Delimiter Token Filter
edit
IMPORTANT: This documentation is no longer updated. Refer to Elastic's version policy and the latest documentation.
Word Delimiter Token Filter
editNamed word_delimiter, it Splits words into subwords and performs
optional transformations on subword groups. Words are split into
subwords with the following rules:
- split on intra-word delimiters (by default, all non alpha-numeric characters).
- "Wi-Fi" → "Wi", "Fi"
- split on case transitions: "PowerShot" → "Power", "Shot"
- split on letter-number transitions: "SD500" → "SD", "500"
- leading and trailing intra-word delimiters on each subword are ignored: "//hello---there, dude" → "hello", "there", "dude"
- trailing "'s" are removed for each subword: "O’Neil’s" → "O", "Neil"
Parameters include:
-
generate_word_parts -
If
truecauses parts of words to be generated: "PowerShot" ⇒ "Power" "Shot". Defaults totrue. -
generate_number_parts -
If
truecauses number subwords to be generated: "500-42" ⇒ "500" "42". Defaults totrue. -
catenate_words -
If
truecauses maximum runs of word parts to be catenated: "wi-fi" ⇒ "wifi". Defaults tofalse. -
catenate_numbers -
If
truecauses maximum runs of number parts to be catenated: "500-42" ⇒ "50042". Defaults tofalse. -
catenate_all -
If
truecauses all subword parts to be catenated: "wi-fi-4000" ⇒ "wifi4000". Defaults tofalse. -
split_on_case_change -
If
truecauses "PowerShot" to be two tokens; ("Power-Shot" remains two parts regards). Defaults totrue. -
preserve_original -
If
trueincludes original words in subwords: "500-42" ⇒ "500-42" "500" "42". Defaults tofalse. -
split_on_numerics -
If
truecauses "j2se" to be three tokens; "j" "2" "se". Defaults totrue. -
stem_english_possessive -
If
truecauses trailing "'s" to be removed for each subword: "O’Neil’s" ⇒ "O", "Neil". Defaults totrue.
Advance settings include:
-
protected_words -
A list of protected words from being delimiter.
Either an array, or also can set
protected_words_pathwhich resolved to a file configured with protected words (one on each line). Automatically resolves toconfig/based location if exists. -
type_table -
A custom type mapping table, for example (when configured
using
type_table_path):
# Map the $, %, '.', and ',' characters to DIGIT
# This might be useful for financial data.
$ => DIGIT
% => DIGIT
. => DIGIT
\\u002C => DIGIT
# in some cases you might not want to split on ZWJ
# this also tests the case where we need a bigger byte[]
# see http://en.wikipedia.org/wiki/Zero-width_joiner
\\u200D => ALPHANUM
Using a tokenizer like the standard tokenizer may interfere with
the catenate_* and preserve_original parameters, as the original
string may already have lost punctuation during tokenization. Instead,
you may want to use the whitespace tokenizer.