Save 10% disk space on your logging datasets with match_only_text
Elasticsearch 7.14 introduces match_only_text, a new field type that can be used as a drop-in replacement for the text
field type in logging use cases with a much lower disk footprint, leading to lower costs.
Elasticsearch is attractive for log analysis thanks to its ability to index log messages. Want to count how many log messages contain access denied
in the last 24 hours? Elasticsearch can give you the answer in milliseconds thanks to its index structures — but index structures take CPU time to build and need disk space. You could save this CPU and disk space by not indexing your message
fields, but then you would also lose the ability to query your logging data in an interactive way.
In order to reduce disk space requirements, match_only_text
only indexes a subset of the information that text
fields index. This brings the following downsides:
- Relevancy scores are computed as the number of matching terms. This typically doesn't matter for logging use cases, as documents are sorted by descending timestamp rather than by relevance score.
- Span queries are unsupported. All other types of queries that are supported on
text
fields are also supported onmatch_only_text
fields. - Phrase and intervals queries run slower than with
text
fields, yet still much faster than a linear scan. Other types of queries run as fast if not slightly faster than ontext
fields.
We ran a variety of benchmarks with this new field type and observed an average 10% reduction of the size of indices containing application logs. This is the second significant index size decrease we are introducing in recent months — in 7.10, we introduced another reduction of around 10% in index size through an improvement to stored fields compression.
As of 7.14, application logs indexed with Elastic Agent will use match_only_text
instead of text
for the message
field of application logs, and we plan to roll out this change to our integrations starting with 7.15. All you have to do to benefit from these space savings is to upgrade to a new version of the Elastic Stack.
How does it work under the hood?
match_only_text
is a new field type that trims everything that text
fields compute and index that is not crucial for log analysis:
- length normalization factors
- term frequencies
- positions
Length normalization factors and term frequencies are only used to compute scores, so dropping them was an easy win given that relevance scoring is typically useless for logging use cases, since events are sorted by descending timestamp rather than by relevance.
Positions are more interesting, since the text
field type uses them to run positional queries such as phrase queries or intervals queries. So how does match_only_text
run phrase queries? This new field type borrows ideas from runtime fields. In order to run phrase queries, it will load the value of your field from the _source
of the document to check whether terms actually occur at consecutive positions. But it does so only when strictly necessary to verify whether a document matches. For instance, if your query is log.level: warn AND message:"node left"
and you have a range filter on your @timestamp
field, Elasticsearch will only load the _source
of documents that match all required clauses as well as the terms of the phrase query. So in this case, it will only load the _source
of documents that:
- match the range filter on
@timestamp
, - match
log.level: warn
, - and contain both
node
andleft
in their message field.
As a result, while match_only_text
performs slower than text
on phrase and intervals queries, it still performs much better than a linear scan.
Conclusion
We encourage you to try it out in your existing deployment, or spin up a free trial of Elasticsearch Service on Elastic Cloud, which always has the latest version of Elasticsearch. We’re looking forward to hearing your feedback, so please let us know what you think on Discuss.