Common grams token filter
editCommon grams token filter
editGenerates bigrams for a specified set of common words.
For example, you can specify is and the as common words. This filter then
converts the tokens [the, quick, fox, is, brown] to [the, the_quick, quick,
fox, fox_is, is, is_brown, brown].
You can use the common_grams filter in place of the
stop token filter when you don’t want to
completely ignore common words.
This filter uses Lucene’s CommonGramsFilter.
Example
editThe following analyze API request creates bigrams for is
and the:
response = client.indices.analyze(
body: {
tokenizer: 'whitespace',
filter: [
{
type: 'common_grams',
common_words: [
'is',
'the'
]
}
],
text: 'the quick fox is brown'
}
)
puts response
GET /_analyze
{
"tokenizer" : "whitespace",
"filter" : [
{
"type": "common_grams",
"common_words": ["is", "the"]
}
],
"text" : "the quick fox is brown"
}
The filter produces the following tokens:
[ the, the_quick, quick, fox, fox_is, is, is_brown, brown ]
Add to an analyzer
editThe following create index API request uses the
common_grams filter to configure a new
custom analyzer:
response = client.indices.create(
index: 'common_grams_example',
body: {
settings: {
analysis: {
analyzer: {
index_grams: {
tokenizer: 'whitespace',
filter: [
'common_grams'
]
}
},
filter: {
common_grams: {
type: 'common_grams',
common_words: [
'a',
'is',
'the'
]
}
}
}
}
}
)
puts response
PUT /common_grams_example
{
"settings": {
"analysis": {
"analyzer": {
"index_grams": {
"tokenizer": "whitespace",
"filter": [ "common_grams" ]
}
},
"filter": {
"common_grams": {
"type": "common_grams",
"common_words": [ "a", "is", "the" ]
}
}
}
}
}
Configurable parameters
edit-
common_words -
(Required*, array of strings) A list of tokens. The filter generates bigrams for these tokens.
Either this or the
common_words_pathparameter is required. -
common_words_path -
(Required*, string) Path to a file containing a list of tokens. The filter generates bigrams for these tokens.
This path must be absolute or relative to the
configlocation. The file must be UTF-8 encoded. Each token in the file must be separated by a line break.Either this or the
common_wordsparameter is required. -
ignore_case -
(Optional, Boolean)
If
true, matches for common words matching are case-insensitive. Defaults tofalse. -
query_mode -
(Optional, Boolean) If
true, the filter excludes the following tokens from the output:- Unigrams for common words
- Unigrams for terms followed by common words
Defaults to
false. We recommend enabling this parameter for search analyzers.For example, you can enable this parameter and specify
isandtheas common words. This filter converts the tokens[the, quick, fox, is, brown]to[the_quick, quick, fox_is, is_brown,].
Customize
editTo customize the common_grams filter, duplicate it to create the basis
for a new custom token filter. You can modify the filter using its configurable
parameters.
For example, the following request creates a custom common_grams filter with
ignore_case and query_mode set to true:
response = client.indices.create(
index: 'common_grams_example',
body: {
settings: {
analysis: {
analyzer: {
index_grams: {
tokenizer: 'whitespace',
filter: [
'common_grams_query'
]
}
},
filter: {
common_grams_query: {
type: 'common_grams',
common_words: [
'a',
'is',
'the'
],
ignore_case: true,
query_mode: true
}
}
}
}
}
)
puts response
PUT /common_grams_example
{
"settings": {
"analysis": {
"analyzer": {
"index_grams": {
"tokenizer": "whitespace",
"filter": [ "common_grams_query" ]
}
},
"filter": {
"common_grams_query": {
"type": "common_grams",
"common_words": [ "a", "is", "the" ],
"ignore_case": true,
"query_mode": true
}
}
}
}
}