Save 10% disk space on your logging datasets with match_only_text

Elasticsearch 7.14 introduces match_only_text, a new field type that can be used as a drop-in replacement for the text field type in logging use cases with a much lower disk footprint, leading to lower costs.

Elasticsearch is attractive for log analysis thanks to its ability to index log messages. Want to count how many log messages contain access denied in the last 24 hours? Elasticsearch can give you the answer in milliseconds thanks to its index structures — but index structures take CPU time to build and need disk space. You could save this CPU and disk space by not indexing your message fields, but then you would also lose the ability to query your logging data in an interactive way.

In order to reduce disk space requirements, match_only_text only indexes a subset of the information that text fields index. This brings the following downsides:

  • Relevancy scores are computed as the number of matching terms. This typically doesn't matter for logging use cases, as documents are sorted by descending timestamp rather than by relevance score.
  • Span queries are unsupported. All other types of queries that are supported on text fields are also supported on match_only_text fields.
  • Phrase and intervals queries run slower than with text fields, yet still much faster than a linear scan. Other types of queries run as fast if not slightly faster than on text fields.

We ran a variety of benchmarks with this new field type and observed an average 10% reduction of the size of indices containing application logs. This is the second significant index size decrease we are introducing in recent months — in 7.10, we introduced another reduction of around 10% in index size through an improvement to stored fields compression.

As of 7.14, application logs indexed with Elastic Agent will use match_only_text instead of text for the message field of application logs, and we plan to roll out this change to our integrations starting with 7.15. All you have to do to benefit from these space savings is to upgrade to a new version of the Elastic Stack.

How does it work under the hood?

match_only_text is a new field type that trims everything that text fields compute and index that is not crucial for log analysis:

  • length normalization factors
  • term frequencies
  • positions

Length normalization factors and term frequencies are only used to compute scores, so dropping them was an easy win given that relevance scoring is typically useless for logging use cases, since events are sorted by descending timestamp rather than by relevance.

Positions are more interesting, since the text field type uses them to run positional queries such as phrase queries or intervals queries. So how does match_only_text run phrase queries? This new field type borrows ideas from runtime fields. In order to run phrase queries, it will load the value of your field from the _source of the document to check whether terms actually occur at consecutive positions. But it does so only when strictly necessary to verify whether a document matches. For instance, if your query is log.level: warn AND message:"node left" and you have a range filter on your @timestamp field, Elasticsearch will only load the _source of documents that match all required clauses as well as the terms of the phrase query. So in this case, it will only load the _source of documents that:

  • match the range filter on @timestamp,
  • match log.level: warn,
  • and contain both node and left in their message field.

As a result, while match_only_text performs slower than text on phrase and intervals queries, it still performs much better than a linear scan.

Conclusion

We encourage you to try it out in your existing deployment, or spin up a free trial of Elasticsearch Service on Elastic Cloud, which always has the latest version of Elasticsearch. We’re looking forward to hearing your feedback, so please let us know what you think on Discuss.