The true story behind Elasticsearch storage requirements

NOTE: This article now contains outdated information. Check out this updated post about Elasticsearch storage requirements.

UPDATE: The "sequel" to this blog post titled "Part 2.0: The true story behind Elasticsearch storage requirements" was posted on September 15, 2015 which runs these tests against the more recent Elasticsearch 2.0beta1. Don't forget to read that after getting through this one!

This blog post was co-written by Christian Dahlqvist (@acdahlqvist) and Peter Kim (@peterkimnyc), Solutions Architects at Elastic based in London and New York City respectively.


Is my data going to get bigger or smaller? Image credit: amazingillusions.blogspot.com

Introduction

One of our responsibilities as Solutions Architects is to help prospective users of the ELK stack figure out how many and what kind of servers they'll need to buy to support their requirements. Production deployments of the ELK stack vary significantly. Some examples of use cases we've spoken to people about include:

  • Collecting and analyzing Apache and Java app server logs that support a major big box retailer's e-commerce site.
  • Security information and event management (SIEM) solution provided as a service by a major telecom/network company for its customers.
  • Full-text search and faceted navigation for an apartment search website.
  • Organization-wide desktop/laptop systems monitoring for a public school district.

You can run a legitimate mission-critical Elasticsearch deployment with just 1 server or 200 servers. You may need the ability to ingest 1 million documents per second and/or support thousands of simultaneous search queries at sub-second latencies. Or your needs may be significantly more modest because you're just getting the website/mobile app for your startup off the ground.

So in response to the question, “How much hardware will I need to run Elasticsearch?", the answer is always, “It depends."

For this blog post, we'll focus on one element of hardware sizing: figuring out the amount of disk required. Also, we'll be using log data as our test data set.

Indexing logs, many different ways

A typical log message can be anywhere between 200 bytes and 2000 bytes or more. This log message can contain various types of data:

  • numbers indicating response time or response size
  • multi-word strings containing details of a Java exception message
  • single-word strings that aren't really words but might be an identifier such as a computer's hostname
  • something like an IP address that could potentially be used as a lookup key to identify geo-location using geoip

Even if the raw log message is 500 bytes, the amount of space occupied on disk (in its indexed form in Elasticsearch) may be smaller or larger depending on various factors. The best way to start making rough estimates on how much disk you'll need is to do some testing using representative data.

Apparently, there's word going around that the data volume in Elasticsearch experiences significant expansion during the indexing process. While this can be true due to Elasticsearch performing text analysis at index-time, it doesn't have to be true, depending on the types of queries you expect to run and how you configure your indexing accordingly. It's certainly not an “all or nothing" scenario – you can configure certain text fields to be analyzed and others to not be analyzed, in addition to tune other parameters which can have a significant impact on disk utilization. A common question asked with regards to disk usage is whether Elasticsearch uses compression – Elasticsearch does utilize compression but does so in a way that minimizes the impact on query latency. One thing to look forward to is an enhancement targeted for Elasticsearch version 2.0 that will allow some configurability in compression. 

To analyze or to not_analyze

As mentioned above, the textual analysis performed at index time can have a significant impact on disk space. Text analysis is a key component of full text search because it pre-processes the text to optimize the search user experience at query time.

Fields can be configured to be analyzed, not be analyzed, retain both analyzed and non_analyzed versions and also be analyzed in different ways. A great introduction to the analysis process in Elasticsearch can be found in Elasticsearch: The Definitive Guide

Are you _all in?

The _all field is a field, which by default, contains values of all the fields of a document. This is extremely convenient when the user doesn't know the field(s) in which a value occurs so they can search for text without specifying a field to search against. However, there will be additional storage overhead if all of a document's fields are indexed as a part of the _all field in addition to being indexed in its own field. More information about the _all field can be found here: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-all-field.html.

Doc values

One additional lever that can have a significant impact on disk usage is doc values. Doc values are a way to reduce heap memory usage, which is great news for people running applications that require memory-hungry aggregations and sorting queries. However, enabling doc values results in additional on-disk data structures to be created at index time which result in larger index files. More details can be found here: https://www.elastic.co/guide/en/elasticsearch/guide/1.x/doc-values.html.

Replication

Elasticsearch is a distributed system and an assumption in distributed systems design is that hardware will fail. A well-designed distributed system must embrace this assumption and handle failures gracefully. One way in which Elasticsearch ensures resiliency is through the use of replication. Elasticsearch, by default, enables shard-level replication which provides 1 replica copy of each shard located on a different node.

Obviously, if you have an additional copy of your data, this is going to double your storage footprint. Other centralized logging solutions do not enable replication by default (or make it very difficult to set up), so when you're comparing an ELK-based solution to an alternative, you should consider whether replication is factored in.

Tests using structured data

The test log file used for this test is a 67644119 byte log file. It contains 300000 Apache HTTP log entries from a colleague's blog that look something like this:

71.212.224.97 - - [28/May/2014:16:27:35 -0500] "GET /images/web/2009/banner.png 
HTTP/1.1" 200 52315 "http://www.semicomplete.com/projects/xdotool/"
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36"

The testing process itself is straight-forward:

  • Ingest the log file using Logstash with a simple config and a single primary shard
  • Optimize the index to 1 segment (for a consistently comparable size) by calling POST test_index/_optimize?max_num_segments=1
  • Get the index size on disk by calling GET test_index/_stats
  • Remove the index by calling DELETE test_index

Assumptions:

  • There is no replication in this testing because it's done on a single node. If you are planning on enabling replication in your deployment (which we'd strongly recommend unless you really don't mind potentially losing data), you should increase your expected storage needs by your replication factor.
  • The 'message' field generated by Logstash is removed. In case you aren't familiar with Logstash, it reads each line of input into a single 'message' field from which you ideally parse out all the valuable data elements. We removed the 'message' field because it increases the storage footprint. However, some folks may want to retain the log line in its original form if there is concern that the implemented grok patterns may not necessarily retain all the necessary data.

Here is a summary of the test results:


Test number

string fields

_all

doc_values

index size (in bytes)

Expansion ratio (index size / raw size)

1

analyzed and not_analyzed


enabled
enabled
75649594

1.118

2

analyzed and not_analyzed

disabled
enabled
58841748

0.869

3

not_analyzed
disabledenabled
47987647

0.709

4

not_analyzed, except for 'agent' field which is indexed as analyzed
disabledenabled
51025522

0.754

5

analyzed and not_analyzed
enableddisabled
65640756

0.970

6

analyzed and not_analyzed
disableddisabled
48834124

0.721

7

not_analyzed
disableddisabled
37465442

0.553

8

not_analyzed, except for 'agent' field which is indexed as analyzed
disableddisabled
41480551

0.613

Note: In the table above, where it says “analyzed and not_analyzed", this means mapping a single source field into multiple indexed fields that reflect different analysis – one analyzed and the other not_analyzed. See more details regarding multi-fields here: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-core-types.html#_multi_fields_3.

Tests using semi-structured data

The test log file used for this test is a 75037027 byte log file. It contains 100000 Apache HTTP log entries from the file used in the previous tests, enhanced with a text entry at the end, taken from a semi-random selection of questions and answers from a data dump of the serverfault.com web site: https://archive.org/details/stackexchange. The text has been cleaned up and the entries look something like this:

83.149.9.216 - - [28/May/2014:16:13:46 -0500] "GET /presentations/logstash-monitorama-2013/images/Dreamhost_logo.svg HTTP/1.1" 200 2126 "http://semicomplete.com/presentations/logstash-monitorama-2013/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.77 Safari/537.36"  There's a new initialize-from-LSN method but it was only introduced in 2008. There's no way to do the equivalent in earlier versions, including 2000. Sorry.

The testing process and assumptions are the same as the previous tests.

Here is a summary of the test results:


Test number

string fields

_all

doc_values

index size (in bytes)

Expansion ratio (index size / raw size)

1

analyzed and not_analyzed


enabled
enabled
104939322

1.399

2

analyzed and not_analyzed

disabled
enabled
78859509

1.051

3

not_analyzed
disabledenabled
74978314

0.999

4

not_analyzed, except for 'agent' field which is indexed as analyzed
disabledenabled
76049868

1.013

5

analyzed and not_analyzed
enableddisabled
101608174

1.354   


6

analyzed and not_analyzed
disableddisabled
75517253

1.006

7

not_analyzed
disableddisabled
71425863

0.951

8

not_analyzed, except for 'agent' field which is indexed as analyzed
disableddisabled
72832811

0.971


Analysis of the results

As you can see from the tables above, we see expansion/contraction ratios between 0.553 and 1.118 for structured data and between 0.951 and 1.399 for semi-structured data depending on how you configure the Elasticsearch mapping. It is also clear that highly structured data allows for better compression compared to semi-structured data. For smaller deployments, this won't make a huge difference – disk is relatively cheap and a 1.5x - 2x difference from the best case to worst case isn't a significant variance. However, if you're planning for a larger deployment, it will certainly be worth having some intentionality in how you configure your mapping.

For example, if you're expecting to ingest 5 TB of structured log data per day and store it for 30 days, you're looking at a difference between 83 and 168 TB in total storage needs when comparing the mappings with minimum vs. maximum storage needs. Depending on other factors which will help define how much data you can host on each node while maintaining reasonable query performance, this could mean 20-30 extra nodes. And that's not even considering replication.

While there are a number of dimensions in which you can make comparisons, I'll focus on a few.

Disabling the _all field reduced the expansion factor from 1.118 to 0.870 for structured data and from 1.399 to 1.051 for semi-structured data. This is a significant reduction in storage footprint which is an easy win if your users are familiar with the fields they want to search against. Even if you can't assume your users know what fields to search, you can customize your search application to take what the user perceives as a non-fielded search and construct a multi-field search query behind the scenes.

Configuring the mapping to index most or all of the fields as “not_analyzed" reduced the expansion factor from 0.870 to 0.754 or 0.709 for structured data. In the log analysis use case, realistically, many, if not, most of the fields don't represent data that makes sense to run textual analysis on. There are a lot of fields you'll certainly want to run aggregate analysis on (e.g. histograms, pie charts, heat maps, etc.) but these don't require text analysis.

Finally, the last area of focus is the impact of doc values. Looking at two mappings that are equivalent besides the doc values config, the difference in expansion factor is 1.118 and 0.970 for structured data. Again, the types of queries you'll expect to run will drive whether you want to enable doc values or not. Heavy use of aggregations and sorting will certainly benefit from using doc values. In most scenarios, JVM heap memory is more precious than disk; the tradeoff of slightly higher disk usage for significantly lower JVM heap utilization is one that most people are glad to make.

Conclusion

There are a lot of misconceptions out there about how much disk space an ELK-based solution requires but hopefully this blog post sheds some light on how the reality is that “it depends". Also, figuring out how much hardware you need involves much more than just how much disk is required. We'll save those discussions for future blog posts. :)

You can find the files supporting this testing on Github here: https://github.com/elastic/elk-index-size-tests.

UPDATE: And don't forget to read the new blog post which provides an update to the findings above using Elasticsearch 2.0beta1!