- Elasticsearch - The Definitive Guide:
- Foreword
- Preface
- Getting Started
- You Know, for Search…
- Installing and Running Elasticsearch
- Talking to Elasticsearch
- Document Oriented
- Finding Your Feet
- Indexing Employee Documents
- Retrieving a Document
- Search Lite
- Search with Query DSL
- More-Complicated Searches
- Full-Text Search
- Phrase Search
- Highlighting Our Searches
- Analytics
- Tutorial Conclusion
- Distributed Nature
- Next Steps
- Life Inside a Cluster
- Data In, Data Out
- What Is a Document?
- Document Metadata
- Indexing a Document
- Retrieving a Document
- Checking Whether a Document Exists
- Updating a Whole Document
- Creating a New Document
- Deleting a Document
- Dealing with Conflicts
- Optimistic Concurrency Control
- Partial Updates to Documents
- Retrieving Multiple Documents
- Cheaper in Bulk
- Distributed Document Store
- Searching—The Basic Tools
- Mapping and Analysis
- Full-Body Search
- Sorting and Relevance
- Distributed Search Execution
- Index Management
- Inside a Shard
- You Know, for Search…
- Search in Depth
- Structured Search
- Full-Text Search
- Multifield Search
- Proximity Matching
- Partial Matching
- Controlling Relevance
- Theory Behind Relevance Scoring
- Lucene’s Practical Scoring Function
- Query-Time Boosting
- Manipulating Relevance with Query Structure
- Not Quite Not
- Ignoring TF/IDF
- function_score Query
- Boosting by Popularity
- Boosting Filtered Subsets
- Random Scoring
- The Closer, The Better
- Understanding the price Clause
- Scoring with Scripts
- Pluggable Similarity Algorithms
- Changing Similarities
- Relevance Tuning Is the Last 10%
- Dealing with Human Language
- Aggregations
- Geolocation
- Modeling Your Data
- Administration, Monitoring, and Deployment
WARNING: The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.
This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.
Fielddata Filtering
editFielddata Filtering
editImagine that you are running a website that allows users to listen to their
favorite songs. To make it easier for them to manage their music library,
users can tag songs with whatever tags make sense to them. You will end up
with a lot of tracks tagged with rock
, hiphop
, and electronica
, but
also with some tracks tagged with my_16th_birthday_favorite_anthem
.
Now imagine that you want to show users the most popular three tags for each
song. It is highly likely that tags like rock
will show up in the top
three, but my_16th_birthday_favorite_anthem
is very unlikely to make the
grade. However, in order to calculate the most popular tags, you have been
forced to load all of these one-off terms into memory.
Thanks to fielddata filtering, we can take control of this situation. We know that we’re interested in only the most popular terms, so we can simply avoid loading any terms that fall into the less interesting long tail:
PUT /music/_mapping/song { "properties": { "tag": { "type": "string", "fielddata": { "filter": { "frequency": { "min": 0.01, "min_segment_size": 500 } } } } } }
The |
|
The |
|
Load only terms that occur in at least 1% of documents in this segment. |
|
Ignore any segments that have fewer than 500 documents. |
With this mapping in place, only terms that appear in at least 1% of the
documents in that segment will be loaded into memory. You can also specify a
max
term frequency, which could be used to exclude terms that are too
common, such as stopwords.
Term frequencies, in this case, are calculated per segment. This is a limitation of the implementation: fielddata is loaded per segment, and at that point the only term frequencies that are visible are the frequencies for that segment. However, this limitation has interesting properties: it allows newly popular terms to rise to the top quickly.
Let’s say that a new genre of song becomes popular one day. You would like to
include the tag for this new genre in the most popular list, but if you were
relying on term frequencies calculated across the whole index, you would have
to wait for the new tag to become as popular as rock
and electronica
.
Because of the way frequency filtering is implemented, the newly added tag
will quickly show up as a high-frequency tag within new segments, so will
quickly float to the top.
The min_segment_size
parameter tells Elasticsearch to ignore segments below
a certain size. If a segment holds only a few documents, the term frequencies
are too coarse to have any meaning. Small segments will soon be merged into
bigger segments, which will then be big enough to take into account.
Filtering terms by frequency is not the only option. You can also decide to
load only those terms that match a regular expression. For instance, you
could use a regex
filter on tweets to load only hashtags into memory — terms the start with a #
. This assumes that you are using an analyzer that
preserves punctuation, like the whitespace
analyzer.
Fielddata filtering can have a massive impact on memory usage. The trade-off is fairly obvious: you are essentially ignoring data. But for many applications, the trade-off is reasonable since the data is not being used anyway. The memory savings is often more important than including a large and relatively useless long tail of terms.