Space Savings: A Lesser Known Benefit of Index Sorting in Elasticsearch
In Elasticsearch 6.0, we released a new feature called index sorting. Read up more on this on the linked blog, but in short, what this does is to take the documents at index time and sort them by a key or set of keys in an order of your choosing. This has a few advantages that we’ve talked about:
- If you ask Elasticsearch to return a set of results sorted by the same key you’ve sorted the index on, Elasticsearch doesn’t need to do the sorting of results at query time. They’re already pre-sorted!
- If you don’t need the total hit count and you’re sorting by the sort key, Elasticsearch can actually short circuit the query once it’s found enough hits to meet your request. This can result in unbounded query performance improvements.
- If you have queries that use ANDs across different fields, index sorting on those fields can end up grouping them in such a way that Elasticsearch can skip large blocks of documents that don’t match, which can also make search faster.
In short, index sorting can make search faster in some cases: especially cases where you have a few common ways people are searching and sorting your documents. What isn’t often talked about is that index sorting can also reduce the amount of disk your indices are using. I’ll talk about why and how here.
Caution: Index Sorting Isn’t for Everyone
Before I talk about what happens, I do want to mention once again that index sorting isn’t for everyone. It causes the sort action to happen at index time. Sorting is an expensive operation, so if indexing speed is a primary concern, exercise caution before turning it on. It can slow write performance by as much as 40-50%. So if indexing throughput is a primary concern, as it often is in high-volume logging, metrics, and security analytics use cases, index sorting probably isn’t for you. It can be useful if you have a lower index rate, if query speed is the most important thing to your use case, or if you have a regular reindex process anyway that acts during off-peak-indexing times.
Examining Potential Sort Orders: An Example
Suppose I run an Elasticsearch instance that’s used for product search. Imagine that I have a set of documents that, at index time, look something like the following (I’ll flatten into a matrix just for ease of viewing):
Product ID | Product Category | Product Color | Price |
206f467b-8cfe | Shoes | Red | $97.00 |
4f89fbec-acc3 | Jackets | Black | $120.50 |
47771396-dfe3 | Jackets | Grey | $170.10 |
c6c8fbdf-651b | Hats | Yellow | $15.00 |
dc18c426-0eb3 | Shoes | Red | $107.20 |
ee304259-df57 | Jackets | Black | $88.00 |
9332c0ac-e55e | Shoes | Black | $49.00 |
30e96765-52a1 | Hats | Blue | $11.00 |
811cc8ca-d6bb | Jackets | Blue | $92.99 |
And now suppose we wanted to enable index sorting. What would we sort by? We have some options: product category, product color, and/or price may be interesting. If user searches are almost always just sorted by price and we don’t have filters for category or color, sorting by price may make the most sense for a sort key. However, it’s probably likely that users are at least selecting a category before finding the cheapest item, and they also may have preference for a color. Let’s sort by category ascending, color ascending, and then price ascending.
"sort.field" : ["product_category", "product_color", "price"], "sort.order" : ["asc", "asc", "asc"]
The sorted index looks something like this:
Product ID | Product Category | Product Color | Price |
30e96765-52a1 | Hats | Blue | $11.00 |
c6c8fbdf-651b | Hats | Yellow | $15.00 |
ee304259-df57 | Jackets | Black | $88.00 |
4f89fbec-acc3 | Jackets | Black | $120.50 |
811cc8ca-d6bb | Jackets | Blue | $92.99 |
47771396-dfe3 | Jackets | Grey | $170.10 |
9332c0ac-e55e | Shoes | Black | $49.00 |
206f467b-8cfe | Shoes | Red | $97.00 |
dc18c426-0eb3 | Shoes | Red | $107.20 |
A few interesting things now happen, which I’ll explain by example:
- If I ask Elasticsearch for the top 2 cheapest shoes sorted by price and I don’t ask it for a total hit count of all shoes, it needs to find the block of shoes, which it can do efficiently by skipping past all of the other categories. And once it’s found just 2 results, it can stop processing the rest of the index and return. Note that in order for this to work, you must include every element of the sort order in the index, even if you have filters that match.
- If I ask Elasticsearch for
product_category:Jackets AND product_color:Black
, Elasticsearch can skip past all the hats and all the shoes, and within Jackets, it can find the “Black” ones and once it’s done looking at those, it can skip past all of the other colors efficiently. - Elasticsearch uses compression significantly behind the scenes. The compression works when there are repeated values and it does so most efficiently when the repeated values are near each other in the index. By having all of the “jackets” or “colors” next to one another, those can be efficiently compressed on disk. That means less disk space, but it also means the operating system will be able to fit more into the filesystem cache, which will make things even faster.
In general, it’s best practice to use sort orders of increasing cardinality so that we can get the benefit of as many repeated values in a row as possible.
How Much Disk Space Will I Save?
So how much disk space will you save by turning on index sorting? As with many things in life, the answer is “it depends.” One of the big things it depends on is the cardinality of the field you sort by. However, the disk savings can be substantial. Last weekend, I decided to move some IoT/home automation data I’ve been using for a personal project off an old machine and onto a new one. There are faster ways to do this data migration like snapshot/restore, but I had the time to do a reindex and was interested to see how much savings index sorting may provide. I first did the remote reindex into an unsorted index:
status index pri docs.count docs.deleted pri.store.size open devices-2017 1 33310674 0 4.2gb
It’s a little over 30 devices and each sends its status back about once every 30 seconds, so the total index rate is about 1 document per second. I’m never anywhere close to being throttled on the indexing side, and I’d have to massively increase the index rate or number of devices for that to change. It seems like a reasonable candidate for index sorting. The data consists of hardware IDs, hardware names, times, and various sensor readings like the temperature or whether the device is on or off at the time or what some other sensor level is. I sorted the index by device ID and then by time, my thinking being that it’s likely that for a given device, there’s a reasonably high chance that there will be similar or the same values around the same time, which may lead to better compression. For example, if a switch is turned into an “on” state at 7:00:00, there’s a pretty high chance that it will be “on” at 7:00:30, 7:01:00, and at least several minutes after, and that should compress nicely. The sorted index stats show…
status index pri docs.count docs.deleted pri.store.size open devices-2017 1 3310674 0 2.5gb
About a 40% disk savings!
Caution Again
At this point, I feel compelled to caution everyone again, because almost everyone would appreciate using 40% less disk for the same data. I’ll sum it up into two quick statements:
- Your mileage will vary. I used index sorting on another dataset and found a 20% savings. Be thoughtful about what fields you sort on.
- Your index rate will be slowed. If the speed of indexing is really important to you, e.g. you’re running in a high volume logging or metrics use case, you’re likely sensitive to the number of documents you can get indexed in a short amount of time and it may not be a good choice to turn on index sorting as a result.
If you’re much more sensitive to disk space than index speed or your index volumes are low enough that you’re not limited by index speed, it may be interesting to see if index sorting is worth it for you.