Practical BM25 - Part 3: Considerations for Picking b and k1 in Elasticsearch
This is the final post in the three-part Practical BM25 series about similarity ranking (relevancy). If you're just joining, check out Part 1: How Shards Affect Relevance Scoring in Elasticsearch and Part 2: The BM25 Algorithm and its Variables.
Picking b and k1
It’s worth noting that picking b and k1 are generally not the first thing to do when your users aren’t finding documents quickly. The default values of b = 0.75 and k1 = 1.2 work pretty well for most corpuses, so you’re likely fine with the defaults. More likely, you want to start with something like:
- Boosting or adding constant scores for things like exact phrase matches in a bool query
- Making use of synonyms to match other terms the user may be interested in
- Adding fuzziness, typeahead, phonetic matching, stemming, and other text/analysis components to help with misspellings, language differences, etc.
- Adding or using a function score to decay the scores of older documents or documents which are further geographically from the end user
Part of what makes Elasticsearch so powerful is how expressive you can be with these primitives to create a very robust search experience. But let’s say you’ve done everything else and want to look at the last mile of b and k1... how do you choose them?
It’s been fairly well studied that there are no “best” b and k1 values for all data/queries. Users that do change the b and k1 parameters typically do so incrementally by evaluating each increment. The Rank Eval API in Elasticsearch can help with the evaluation stage.
When experimenting with b and k1, you should first consider their bounds. I would also suggest looking into past experiments to give you a rough feel for the type of experimentation you may be interested in doing — especially if you’re just getting into this for the first time:
- b needs to be between 0 and 1. Many experiments test values in increments of around 0.1 and most experiments seem to show the optimal b to be in a range of 0.3-0.9 (Lipani, Lupu, Hanbury, Aizawa (2015); Taylor, Zaragoza, Craswell, Robertson, Burges (2006); Trotman, Puurula, Burgess (2014); etc.)
- k1 is typically evaluated in the 0 to 3 range, though there’s nothing to stop it from being higher. Many experiments have focused on increments of 0.1 to 0.2 and most experiments seem to show the optimal k1 to be in a range of 0.5-2.0
For k1, you should be asking, “when do we think a term is likely to be saturated?” For very long documents like books — especially fictional or multi-topic books — it’s very likely to have a lot of different terms several times in a work, even when the term isn’t highly relevant to the work as a whole. For example, “eye” or “eyes” can appear hundreds of times in a fictional book even when “eyes” are not one of the the primary subjects of the book. A book that mentions “eyes” a thousand times, though, likely has a lot more to do with eyes. You may not want terms to be saturated as quickly in this situation, so there’s some suggestion that k1 should generally trend toward larger numbers when the text is a lot longer and more diverse. For the inverse situation, it’s been suggested to set k1 on the lower side. It’s very unlikely that a collection of short news articles would have “eyes” dozens to hundreds of times without being highly related to eyes as a subject.
For b, you should be asking, “when do we think a document is likely to be very long, and when should that hinder its relevance to a term?” Documents which are highly specific like engineering specifications or patents are lengthy in order to be more specific about a subject. Their length is unlikely to be detrimental to the relevance and b may be more appropriate to be lower. On the opposite side, documents which touch on several different topics in a broad way — news articles (a political article may touch on economics, international affairs, and certain corporations), user reviews, etc. — often benefit by choosing a larger b so that irrelevant topics to a user’s search, including spam and the like, are penalized.
These are general starting points, but ultimately you should test any parameters you set. This also goes to showcase how relevance is really tightly bound to having similar documents (similar language, similar general structure, etc.) in the same index.
Explain API
Now that you understand how the BM25 algorithm works and how the parameters work, I want to briefly cover one of the handiest tools in the Elasticsearch toolbox for giving you more information to answer the “why” questions that inevitably come up. If you’ve ever had to answer the question “why is document x ranked higher than document y” the Explain API can help you significantly. Let’s look at document 4 in people3, this time with a two-term query:
GET /people3/_doc/4/_explain { "query": { "match": { "title": "shane connelly" } } }
This returns a wealth of information about how document 4 was scored against this query:
{ "_index": "people3", "_type": "_doc", "_id": "4", "matched": true, "explanation": { "value": 0.71437943, "description": "sum of:", "details": [ { "value": 0.102611035, "description": "weight(title:shane in 3) [PerFieldSimilarity], result of:", "details": [ { "value": 0.102611035, "description": "score(doc=3,freq=1.0 = termFreq=1.0\n), product of:", "details": [ { "value": 0.074107975, "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:", "details": [ { "value": 6, "description": "docFreq", "details": [] }, { "value": 6, "description": "docCount", "details": [] } ] }, { "value": 1.3846153, "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:", "details": [ { "value": 1, "description": "termFreq=1.0", "details": [] }, { "value": 5, "description": "parameter k1", "details": [] }, { "value": 1, "description": "parameter b", "details": [] }, { "value": 3, "description": "avgFieldLength", "details": [] }, { "value": 2, "description": "fieldLength", "details": [] } ] } ] } ] }, { "value": 0.61176836, "description": "weight(title:connelly in 3) [PerFieldSimilarity], result of:", "details": [ { "value": 0.61176836, "description": "score(doc=3,freq=1.0 = termFreq=1.0\n), product of:", "details": [ { "value": 0.44183275, "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:", "details": [ { "value": 4, "description": "docFreq", "details": [] }, { "value": 6, "description": "docCount", "details": [] } ] }, { "value": 1.3846153, "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:", "details": [ { "value": 1, "description": "termFreq=1.0", "details": [] }, { "value": 5, "description": "parameter k1", "details": [] }, { "value": 1, "description": "parameter b", "details": [] }, { "value": 3, "description": "avgFieldLength", "details": [] }, { "value": 2, "description": "fieldLength", "details": [] } ] } ] } ] } ] } }
We can see the individual values of k1 and b, but also the fieldLength and avgFieldLength and other constituent parts of the score for each term! So with our final score of 0.71437943, we can see that 0.102611035 came from “shane” and 0.61176836 came from “connelly.” Connelly is a much rarer term in our corpus, so it carries a much higher IDF, which seems to have been the primary influencer of the score. But we can also see that the length of the document (2 terms vs an average of 3) has raised the “tfNorm” part of the score as well. If we felt that this was unfair, we could potentially decrease the value of b to compensate. Of course any changes to b or k1 have an effect on more than just the one given query here, so if you do end up changing these, make sure to re-test across many queries and many documents.
Please do note that the Explain API is a debug tool and treated as such. Just like you wouldn’t run your production application in debug mode under normal circumstances, you should turn off calls to _explain
in your production deployment of Elasticsearch under normal circumstances.
Last Words
BM25 isn’t the only scoring algorithm in town! There’s the classic TF/IDF, divergence from randomness, and many, many more — not to mention hyperlink-based modifiers like pagerank — and even further you can generally combine many of these together! Additionally, a variety of permutations on the core BM25 algorithm have popped up over the years. For example, there has been some academic effort to pick/suggest/account for optimal values for k1 and b automatically with some of these BM25 permutations. In fact, there is some reason/evidence to believe that at least k1 would be best optimized on a term-by-term basis (Lv, ChengXiang (2011)). With this, it’d be natural to ask “why BM25?” or “why BM25 with these particular k1 = 1.2 and b = 0.75 values?”
The short answer is that there doesn’t appear to be any silver bullet in algorithms or in picking k1 or b values, but BM25 with k1 = 1.2 and b = 0.75 seem to give very good results for most cases. In “Improvements to BM25 and Language Models Examined” (Trotman, Puurula, Burgess (2014)), Trotman et al. searched over b = 0-1 and k1 = 0-3, and applied a number of different relevance algorithms, including ones that attempt to automatically tune the BM25 parameters. I think they state it best in their conclusion:
“This investigation examined 9 ranking functions, 2 relevance feedback methods, 5 stemming algorithms, and 2 stop word lists. It shows that stop words are ineffective, that stemming is effective, that relevance feedback is effective, and that the combination of not stopping, stemming, and feedback typically leads to improvements on a plain ranking function. However, there is no clear evidence that any one of the ranking functions is systematically better than the others.”
So, as we started this blog, so shall we end: most of your tuning efforts are likely best spent making use of the expressive Elasticsearch query language, index/linguistic controls, and incorporating user feedback, all of which can be done via the Elasticsearch APIs. For those that do all of this and then want to dive really deeply, consider varying the k1 and b parameters. For those that want to go even further, Elasticsearch supports pluggable similarity algorithms including shipping with a number of the more common ones. But no matter what you do, make sure you test your changes!
--
References
Lipani, A., Lupu, M., Hanbury, A., Aizawa, A. (2015). Verboseness Fission for BM25 Document Length Normalization. Association for Computing Machinery
Taylor, M., Zaragoza, H., Craswell, N., Robertson, S., Burges, C. (2006). Optimisation methods for ranking functions with multiple parameters. Association for Computing Machinery
Trotman, A., Puurula, A., Burgess, B. (2014). Improvements to BM25 and Language Models Examined. Association for Computing Machinery
Lv, Y., ChengXiang, Z. (2011). Adaptive term frequency normalization for BM25. Association for Computing Machinery