Hunspell token filter
editHunspell token filter
editProvides dictionary stemming based on a provided
Hunspell dictionary. The hunspell
filter requires
configuration of one or more
language-specific Hunspell dictionaries.
This filter uses Lucene’s HunspellStemFilter.
If available, we recommend trying an algorithmic stemmer for your language
before using the hunspell token filter.
In practice, algorithmic stemmers typically outperform dictionary stemmers.
See Dictionary stemmers.
Configure Hunspell dictionaries
editHunspell dictionaries are stored and detected on a dedicated
hunspell directory on the filesystem: <$ES_PATH_CONF>/hunspell. Each dictionary
is expected to have its own directory, named after its associated language and
locale (e.g., pt_BR, en_GB). This dictionary directory is expected to hold a
single .aff and one or more .dic files, all of which will automatically be
picked up. For example, the following directory layout will define the en_US dictionary:
- config
|-- hunspell
| |-- en_US
| | |-- en_US.dic
| | |-- en_US.aff
Each dictionary can be configured with one setting:
-
ignore_case -
(Static, Boolean) If true, dictionary matching will be case insensitive. Defaults to
false.This setting can be configured globally in
elasticsearch.ymlusingindices.analysis.hunspell.dictionary.ignore_case.To configure the setting for a specific locale, use the
indices.analysis.hunspell.dictionary.<locale>.ignore_casesetting (e.g., for theen_US(American English) locale, the setting isindices.analysis.hunspell.dictionary.en_US.ignore_case).You can also add a
settings.ymlfile under the dictionary directory which holds these settings. This overrides any otherignore_casesettings defined inelasticsearch.yml.
Example
editThe following analyze API request uses the hunspell filter to stem
the foxes jumping quickly to the fox jump quick.
The request specifies the en_US locale, meaning that the
.aff and .dic files in the <$ES_PATH_CONF>/hunspell/en_US directory are used
for the Hunspell dictionary.
resp = client.indices.analyze(
tokenizer="standard",
filter=[
{
"type": "hunspell",
"locale": "en_US"
}
],
text="the foxes jumping quickly",
)
print(resp)
const response = await client.indices.analyze({
tokenizer: "standard",
filter: [
{
type: "hunspell",
locale: "en_US",
},
],
text: "the foxes jumping quickly",
});
console.log(response);
GET /_analyze
{
"tokenizer": "standard",
"filter": [
{
"type": "hunspell",
"locale": "en_US"
}
],
"text": "the foxes jumping quickly"
}
The filter produces the following tokens:
[ the, fox, jump, quick ]
Configurable parameters
edit-
dictionary -
(Optional, string or array of strings) One or more
.dicfiles (e.g,en_US.dic, my_custom.dic) to use for the Hunspell dictionary.By default, the
hunspellfilter uses all.dicfiles in the<$ES_PATH_CONF>/hunspell/<locale>directory specified using thelang,language, orlocaleparameter. -
dedup -
(Optional, Boolean)
If
true, duplicate tokens are removed from the filter’s output. Defaults totrue. -
lang -
(Required*, string) An alias for the
localeparameter.If this parameter is not specified, the
languageorlocaleparameter is required. -
language -
(Required*, string) An alias for the
localeparameter.If this parameter is not specified, the
langorlocaleparameter is required.
-
locale -
(Required*, string) Locale directory used to specify the
.affand.dicfiles for a Hunspell dictionary. See Configure Hunspell dictionaries.If this parameter is not specified, the
langorlanguageparameter is required. -
longest_only -
(Optional, Boolean)
If
true, only the longest stemmed version of each token is included in the output. Iffalse, all stemmed versions of the token are included. Defaults tofalse.
Customize and add to an analyzer
editTo customize the hunspell filter, duplicate it to create the
basis for a new custom token filter. You can modify the filter using its
configurable parameters.
For example, the following create index API request
uses a custom hunspell filter, my_en_US_dict_stemmer, to configure a new
custom analyzer.
The my_en_US_dict_stemmer filter uses a locale of en_US, meaning that the
.aff and .dic files in the <$ES_PATH_CONF>/hunspell/en_US directory are
used. The filter also includes a dedup argument of false, meaning that
duplicate tokens added from the dictionary are not removed from the filter’s
output.
resp = client.indices.create(
index="my-index-000001",
settings={
"analysis": {
"analyzer": {
"en": {
"tokenizer": "standard",
"filter": [
"my_en_US_dict_stemmer"
]
}
},
"filter": {
"my_en_US_dict_stemmer": {
"type": "hunspell",
"locale": "en_US",
"dedup": False
}
}
}
},
)
print(resp)
response = client.indices.create(
index: 'my-index-000001',
body: {
settings: {
analysis: {
analyzer: {
en: {
tokenizer: 'standard',
filter: [
'my_en_US_dict_stemmer'
]
}
},
filter: {
"my_en_US_dict_stemmer": {
type: 'hunspell',
locale: 'en_US',
dedup: false
}
}
}
}
}
)
puts response
const response = await client.indices.create({
index: "my-index-000001",
settings: {
analysis: {
analyzer: {
en: {
tokenizer: "standard",
filter: ["my_en_US_dict_stemmer"],
},
},
filter: {
my_en_US_dict_stemmer: {
type: "hunspell",
locale: "en_US",
dedup: false,
},
},
},
},
});
console.log(response);
PUT /my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"en": {
"tokenizer": "standard",
"filter": [ "my_en_US_dict_stemmer" ]
}
},
"filter": {
"my_en_US_dict_stemmer": {
"type": "hunspell",
"locale": "en_US",
"dedup": false
}
}
}
}
}
Settings
editIn addition to the ignore_case
settings, you can configure the following global settings for the hunspell
filter using elasticsearch.yml:
-
indices.analysis.hunspell.dictionary.lazy -
(Static, Boolean)
If
true, the loading of Hunspell dictionaries is deferred until a dictionary is used. Iffalse, the dictionary directory is checked for dictionaries when the node starts, and any dictionaries are automatically loaded. Defaults tofalse.