Improving text expansion performance using token pruning

Learn about token pruning and how it boosts the performance of text expansion queries by making them more efficient without sacrificing recall.

This blog talks about token pruning, an exciting enhancement to ELSER performance released with Elasticsearch 8.13.0!

The strategy behind token pruning

We've already talked in great detail about lexical and semantic search in Elasticsearch and text similarity search with vector fields. These articles offer great, in-depth explanations of how vector search works.

We've also talked in the past about reducing retrieval costs by optimizing retrieval with ELSER v2. While Elasticsearch is limited to 512 tokens per inference field ELSER can still produce a large number of unique tokens for multi-term queries. This results in a very large disjunction query, and will return many more documents than an individual keyword search would - in fact, queries with a large number of resulting queries may match most or all of the documents in an index!

Now, let's take a more detailed look into an example using ELSER v2. Using the infer API we can view the predicted values for the phrase "Is Pluto a planet?"

POST /_ml/trained_models/.elser_model_2_linux-x86_64/_infer
{
  "docs":[{"text_field": "is Pluto a planet?"}]
}

This returns the following inference results:

{
  "inference_results": [
    {
      "predicted_value": {
        "pluto": 3.014208,
        "planet": 2.6253395,
        "planets": 1.7399588,
        "alien": 1.1358738,
        "mars": 0.8806293,
        "genus": 0.8014013,
        "europa": 0.6215426,
        "a": 0.5890018,
        "asteroid": 0.5530223,
        "neptune": 0.5525891,
        "universe": 0.5023148,
        "venus": 0.47205976,
        "god": 0.37106854,
        "galaxy": 0.36435634,
        "discovered": 0.3450894,
        "any": 0.3425274,
        "jupiter": 0.3314228,
        "planetary": 0.3290833,
        "particle": 0.30925226,
        "moon": 0.29885328,
        "earth": 0.29008925,
        "geography": 0.27968466,
        "gravity": 0.26251012,
        "astro": 0.2522782,
        "biology": 0.2520054,
        "aliens": 0.25142986,
        "island": 0.25103575,
        "species": 0.2500962,
        "uninhabited": 0.23360424,
        "orbit": 0.2327767,
        "existence": 0.21717428,
        "physics": 0.2001011,
        "nuclear": 0.1603676,
        "space": 0.15076339,
        "asteroids": 0.14343098,
        "astronomy": 0.10858688,
        "ocean": 0.08870865,
        "some": 0.065543786,
        "science": 0.051665734,
        "satellite": 0.042373143,
        "ari": 0.024783766,
        "list": 0.019822711,
        "poly": 0.018234596,
        "sphere": 0.01611787,
        "dino": 0.006902895,
        "rocky": 0.0062791444
      }
    }
  ]
}

These are the inference results that would be sent as input into a text expansion search. When we run a text expansion query, these terms eventually get joined together in one large weighted boolean query, such as:

{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "pluto": {
              "query": "pluto",
              "boost": 3.014208
            }
          }
        },
        {
          "match": {
            "planet": {
              "query": "planet",
              "boost": 2.6253395
            }
          }
        },
        ...
        {
          "match": {
            "planets": {
              "query": "dino",
              "boost": 0.006902895
            }
          }
        },
        {
          "match": {
            "planets": {
              "query": "rocky",
              "boost": 0.0062791444
            }
          }
        }
      ]
    }
  }
}

Speed it up by removing tokens

Given the large number of tokens produced by ELSER text expansion, the quickest way to realize a performance improvement is to reduce the number of tokens that make it into that final boolean query. This reduces the total work that Elasticsearch invests when performing the search. We can do this by identifying non-significant tokens produced by the text expansion and removing them from the final query.

Non-significant tokens can be defined as tokens that meet both of the following criteria:

  1. The weight/score is so low that the token is likely not very relevant to the original term
  2. The token appears much more frequently than most tokens, indicating that it is a very common word and may not benefit the overall search results much.

We started with some default rules to identify non-significant tokens, based on internal experimentation using ELSER v2:

  • Frequency: More than 5x more frequent than the average token frequency for all tokens in that field
  • Score: Less than 40% of the best scoring token
  • Missing: If we see documents with a frequency of 0, that means that it never shows up at all and can be safely pruned

If you're using text expansion with a model other than ELSER, you may need to adjust these values in order to return optimal results.

Both the token frequency threshold and weight threshold must show the token is non-significant in order for the token to be pruned. This lets us ensure we keep frequent tokens that are very high scoring or very infrequent tokens that may not have as high of a score.

Performance improvements with token pruning

We benchmarked these changes using the MS Marco Passage Ranking benchmark. Through this benchmarking, we observed that enabling token pruning with the default values described above resulted in a 3-4x improvement in 99th pctile latency and above!

Relevance impact of token pruning

Once we measured a real performance improvement, we wanted to validate that relevance was still reasonable. We used a small dataset against the MS Marco passage ranking dataset. We did observe an impact on relevance when pruning the tokens; however, when we added the pruned tokens back in a rescore block the relevance was close to the original non-pruned results with only a marginal increase in latency. The rescore, adding in the tokens that were previously pruned, queries the pruned tokens only against the documents that were returned from the previous query. Then it updates the score including the dimensions that were previously left behind.

Using a sample of 44 queries with judgments against the MS Marco Passage Ranking dataset:

Top KRescore Window SizeAvg rescored recall vs controlControl NDCG@KPruned NDCG@KRescored NDCG@K
10100.9560.6530.6570.657
1010010.6530.6570.653
10100010.6530.6570.653
1001000.9530.510.3720.514
100100010.510.3720.51

Now, this is only one dataset - but it's encouraging to see this even at smaller scale!

How to use: Pruning configuration

Pruning configuration will launch in our next release as an experimental feature. It's an optional, opt-in feature so if you perform text expansion queries without specifying pruning, there will be no change to how text expansion queries are formulated - and no change in performance.

We have some examples of how to use the new pruning configuration in our text expansion query documentation.

Here's an example text expansion query with both the pruning configuration and rescore:

GET my-index/_search
{
   "query":{
      "text_expansion":{
         "ml.tokens":{
            "model_id":".elser_model_2",
            "model_text":"Is pluto a planet?",
            "pruning_config": {
              "tokens_freq_ratio_threshold": 5,
              "tokens_weight_threshold": 0.4,
              "only_score_pruned_tokens": false
            }
         }
      }
   },
   "rescore": {
      "window_size": 100,
      "query": {
         "rescore_query": {
            "text_expansion": {
               "ml.tokens": {
                  "model_id": ".elser_model_2",
                  "model_text": "Is pluto a planet?",
                  "pruning_config": {
                    "tokens_freq_ratio_threshold": 5,
                    "tokens_weight_threshold": 0.4,
                    "only_score_pruned_tokens": true
                  }
               }
            }
         }
      }
   }
}

Note that the rescore query sets only_score_pruned_tokens to false, so it only adds those tokens that were originally pruned back into the rescore algorithm.

This feature was released as a technical preview feature in 8.13.0. You can try it out in Cloud today! Be sure to head over to our discuss forums and let us know what you think.

Ready to try this out on your own? Start a free trial or use this self-paced hands-on learning for Search AI.

Elasticsearch has integrations for tools from LangChain, Cohere and more. Join our advanced semantic search webinar to build your next GenAI app!

Ready to build state of the art search experiences?

Sufficiently advanced search isn’t achieved with the efforts of one. Elasticsearch is powered by data scientists, ML ops, engineers, and many more who are just as passionate about search as your are. Let’s connect and work together to build the magical search experience that will get you the results you want.

Try it yourself