- Elasticsearch - The Definitive Guide:
- Foreword
- Preface
- Getting Started
- You Know, for Search…
- Installing and Running Elasticsearch
- Talking to Elasticsearch
- Document Oriented
- Finding Your Feet
- Indexing Employee Documents
- Retrieving a Document
- Search Lite
- Search with Query DSL
- More-Complicated Searches
- Full-Text Search
- Phrase Search
- Highlighting Our Searches
- Analytics
- Tutorial Conclusion
- Distributed Nature
- Next Steps
- Life Inside a Cluster
- Data In, Data Out
- What Is a Document?
- Document Metadata
- Indexing a Document
- Retrieving a Document
- Checking Whether a Document Exists
- Updating a Whole Document
- Creating a New Document
- Deleting a Document
- Dealing with Conflicts
- Optimistic Concurrency Control
- Partial Updates to Documents
- Retrieving Multiple Documents
- Cheaper in Bulk
- Distributed Document Store
- Searching—The Basic Tools
- Mapping and Analysis
- Full-Body Search
- Sorting and Relevance
- Distributed Search Execution
- Index Management
- Inside a Shard
- You Know, for Search…
- Search in Depth
- Structured Search
- Full-Text Search
- Multifield Search
- Proximity Matching
- Partial Matching
- Controlling Relevance
- Theory Behind Relevance Scoring
- Lucene’s Practical Scoring Function
- Query-Time Boosting
- Manipulating Relevance with Query Structure
- Not Quite Not
- Ignoring TF/IDF
- function_score Query
- Boosting by Popularity
- Boosting Filtered Subsets
- Random Scoring
- The Closer, The Better
- Understanding the price Clause
- Scoring with Scripts
- Pluggable Similarity Algorithms
- Changing Similarities
- Relevance Tuning Is the Last 10%
- Dealing with Human Language
- Aggregations
- Geolocation
- Modeling Your Data
- Administration, Monitoring, and Deployment
WARNING: The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.
This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.
Algorithmic Stemmers
editAlgorithmic Stemmers
editMost of the stemmers available in Elasticsearch are algorithmic in that they
apply a series of rules to a word in order to reduce it to its root form, such
as stripping the final s
or es
from plurals. They don’t have to know
anything about individual words in order to stem them.
These algorithmic stemmers have the advantage that they are available out of
the box, are fast, use little memory, and work well for regular words. The
downside is that they don’t cope well with irregular words like be
, are
,
and am
, or mice
and mouse
.
One of the earliest stemming algorithms is the Porter stemmer for English, which is still the recommended English stemmer today. Martin Porter subsequently went on to create the Snowball language for creating stemming algorithms, and a number of the stemmers available in Elasticsearch are written in Snowball.
The kstem
token filter is a stemmer
for English which combines the algorithmic approach with a built-in
dictionary. The dictionary contains a list of root words and exceptions in
order to avoid conflating words incorrectly. kstem
tends to stem less
aggressively than the Porter stemmer.
Using an Algorithmic Stemmer
editWhile you can use the
porter_stem
or
kstem
token filter directly, or
create a language-specific Snowball stemmer with the
snowball
token filter, all of the
algorithmic stemmers are exposed via a single unified interface:
the stemmer
token filter, which
accepts the language
parameter.
For instance, perhaps you find the default stemmer used by the english
analyzer to be too aggressive and you want to make it less aggressive.
The first step is to look up the configuration for the english
analyzer
in the language analyzers
documentation, which shows the following:
{ "settings": { "analysis": { "filter": { "english_stop": { "type": "stop", "stopwords": "_english_" }, "english_keywords": { "type": "keyword_marker", "keywords": [] }, "english_stemmer": { "type": "stemmer", "language": "english" }, "english_possessive_stemmer": { "type": "stemmer", "language": "possessive_english" } }, "analyzer": { "english": { "tokenizer": "standard", "filter": [ "english_possessive_stemmer", "lowercase", "english_stop", "english_keywords", "english_stemmer" ] } } } } }
The |
|
The |
Having reviewed the current configuration, we can use it as the basis for a new analyzer, with the following changes:
-
Change the
english_stemmer
fromenglish
(which maps to theporter_stem
token filter) tolight_english
(which maps to the less aggressivekstem
token filter). -
Add the
asciifolding
token filter to remove any diacritics from foreign words. -
Remove the
keyword_marker
token filter, as we don’t need it. (We discuss this in more detail in Controlling Stemming.)
Our new custom analyzer would look like this:
PUT /my_index { "settings": { "analysis": { "filter": { "english_stop": { "type": "stop", "stopwords": "_english_" }, "light_english_stemmer": { "type": "stemmer", "language": "light_english" }, "english_possessive_stemmer": { "type": "stemmer", "language": "possessive_english" } }, "analyzer": { "english": { "tokenizer": "standard", "filter": [ "english_possessive_stemmer", "lowercase", "english_stop", "light_english_stemmer", "asciifolding" ] } } } } }
On this page