- Elasticsearch - The Definitive Guide:
- Foreword
- Preface
- Getting Started
- You Know, for Search…
- Installing and Running Elasticsearch
- Talking to Elasticsearch
- Document Oriented
- Finding Your Feet
- Indexing Employee Documents
- Retrieving a Document
- Search Lite
- Search with Query DSL
- More-Complicated Searches
- Full-Text Search
- Phrase Search
- Highlighting Our Searches
- Analytics
- Tutorial Conclusion
- Distributed Nature
- Next Steps
- Life Inside a Cluster
- Data In, Data Out
- What Is a Document?
- Document Metadata
- Indexing a Document
- Retrieving a Document
- Checking Whether a Document Exists
- Updating a Whole Document
- Creating a New Document
- Deleting a Document
- Dealing with Conflicts
- Optimistic Concurrency Control
- Partial Updates to Documents
- Retrieving Multiple Documents
- Cheaper in Bulk
- Distributed Document Store
- Searching—The Basic Tools
- Mapping and Analysis
- Full-Body Search
- Sorting and Relevance
- Distributed Search Execution
- Index Management
- Inside a Shard
- You Know, for Search…
- Search in Depth
- Structured Search
- Full-Text Search
- Multifield Search
- Proximity Matching
- Partial Matching
- Controlling Relevance
- Theory Behind Relevance Scoring
- Lucene’s Practical Scoring Function
- Query-Time Boosting
- Manipulating Relevance with Query Structure
- Not Quite Not
- Ignoring TF/IDF
- function_score Query
- Boosting by Popularity
- Boosting Filtered Subsets
- Random Scoring
- The Closer, The Better
- Understanding the price Clause
- Scoring with Scripts
- Pluggable Similarity Algorithms
- Changing Similarities
- Relevance Tuning Is the Last 10%
- Dealing with Human Language
- Aggregations
- Geolocation
- Modeling Your Data
- Administration, Monitoring, and Deployment
WARNING: This documentation covers Elasticsearch 2.x. The 2.x versions of Elasticsearch have passed their EOL dates. If you are running a 2.x version, we strongly advise you to upgrade.
This documentation is no longer maintained and may be removed. For the latest information, see the current Elasticsearch documentation.
Relevance Is Broken!
editRelevance Is Broken!
editBefore we move on to discussing more-complex queries in Multifield Search, let’s make a quick detour to explain why we created our test index with just one primary shard.
Every now and again a new user opens an issue claiming that sorting by relevance is broken and offering a short reproduction: the user indexes a few documents, runs a simple query, and finds apparently less-relevant results appearing above more-relevant results.
To understand why this happens, let’s imagine that we create an index with two
primary shards and we index ten documents, six of which contain the word foo
.
It may happen that shard 1 contains three of the foo
documents and shard
2 contains the other three. In other words, our documents are well distributed.
In What Is Relevance?, we described the default similarity algorithm used in Elasticsearch, called term frequency / inverse document frequency or TF/IDF. Term frequency counts the number of times a term appears within the field we are querying in the current document. The more times it appears, the more relevant is this document. The inverse document frequency takes into account how often a term appears as a percentage of all the documents in the index. The more frequently the term appears, the less weight it has.
However, for performance reasons, Elasticsearch doesn’t calculate the IDF across all documents in the index. Instead, each shard calculates a local IDF for the documents contained in that shard.
Because our documents are well distributed, the IDF for both shards will be
the same. Now imagine instead that five of the foo
documents are on shard 1,
and the sixth document is on shard 2. In this scenario, the term foo
is
very common on one shard (and so of little importance), but rare on the other
shard (and so much more important). These differences in IDF can produce
incorrect results.
In practice, this is not a problem. The differences between local and global IDF diminish the more documents that you add to the index. With real-world volumes of data, the local IDFs soon even out. The problem is not that relevance is broken but that there is too little data.
For testing purposes, there are two ways we can work around this issue. The
first is to create an index with one primary shard, as we did in the section
introducing the match
query. If you have only one shard, then
the local IDF is the global IDF.
The second workaround is to add ?search_type=dfs_query_then_fetch
to your
search requests. The dfs
stands for Distributed Frequency Search, and it
tells Elasticsearch to first retrieve the local IDF from each shard in order
to calculate the global IDF across the whole index.
Don’t use dfs_query_then_fetch
in production. It really isn’t
required. Just having enough data will ensure that your term frequencies are
well distributed. There is no reason to add this extra DFS step to every query
that you run.