Improving node resiliency with the real memory circuit breaker
You want to trust that Elasticsearch is reliably handling your search traffic even if your site is under significant load. As Elasticsearch is a distributed system, it is designed from the ground up to be resilient to failures of individual nodes. In fact, we have implemented a new and vastly improved cluster coordination algorithm in Elasticsearch 7.0.0.
Also, individual nodes in Elasticsearch are built with resiliency in mind. If you send too many requests to a node or your requests are too large, it will push back. The latter is achieved by circuit breakers. They are placed at certain points in the request handling path, e.g. when a network request enters the node or before an aggregation is executed. The key idea is to avoid OutOfMemoryError
by estimating upfront whether a request will push the node over its configured limit and then reject the request instead of falling over. In addition to circuit breakers for individual aspects like all in-flight requests or the field data circuit breaker, Elasticsearch also has a "parent circuit breaker" that provides a global view across all circuit breakers. This allows Elasticsearch to reject requests that are within the budget of any individual circuit breaker but which push the system above its total limit across all circuit breakers.
It is impractical to track every allocation, so circuit breakers can only track memory that is explicitly reserved, and sometimes it isn’t possible to estimate the exact memory usage upfront. This means that circuit breakers are only a best-effort mechanism, and while they provide some resiliency against overloading a node, it is still possible that nodes die with an OutOfMemoryError
. This is especially problematic the smaller your heap is, as the relative overhead of untracked memory is larger.
Building (and Testing) a Better Circuit Breaker
What if it were possible to know exactly how much memory a node is using when we make a reservation in a circuit breaker? Then we could reject requests based on the actual state of the system at that point instead of an estimation based on current reservations across circuit breakers. We have done exactly that with the new real memory circuit breaker in Elasticsearch 7.0. It is an alternative implementation of the parent circuit breaker that uses a functionality in the JVM to measure current memory usage instead of only accounting for the currently tracked memory. While this is more costly than just adding up a few numbers, measuring memory usage is still a very cheap operation: in microbenchmarks we have observed overheads between 400 and 900 nanoseconds. We ran a variety of experiments to test the effectiveness of the real memory circuit breaker under different conditions. In one scenario, we ran a full-text indexing benchmark against a node that was configured with only 256MB of heap. While earlier versions of Elasticsearch cannot sustain this workload and run almost immediately out of memory, the real memory circuit breaker pushes back and Elasticsearch can sustain the load. Note that Elasticsearch will return an error response in such cases and it is up to the clients to implement proper backoff and retry mechanisms. Of course, we make this easy provided you are already using one of our official language clients. The .NET, Ruby, Python and Java clients already implement these retry policies, as well as offer extensions to handle bulk indexing.
In another experiment, we executed an aggregation that intentionally produced an unrealistically high number of buckets on a node with 16GB heap. Similarly, earlier versions of Elasticsearch went out of memory but the aggregation ran for almost half an hour until the error has happened. With the real memory circuit breaker the node has instead provided a response, depending on whether we have allowed partial results either after a little bit more than a minute or after roughly twenty minutes. As a result of multiple experiments, we have set the default value for the new circuit breaker to 95% of the totally available heap. This means that Elasticsearch will allow using up to 95% of the heap until the real memory circuit breaker will trip.
Let’s consider an example where a bulk request is sent that is small enough to pass all other checks but that will trip the real memory circuit breaker because it would push the node over its limits. This node runs with 128MB heap configured which means the parent circuit breaker’s limit of 95% is 117.5MB. If this request is sent, the node will respond with a HTTP 429 with the following details:
{
'error': {
'type': 'circuit_breaking_exception',
'reason': '[parent] Data too large, data for [<http_request>] would be [123848638/118.1mb], which is larger than the limit of [123273216/117.5mb], real usage: [120182112/114.6mb], new bytes reserved: [3666526/3.4mb]',
'bytes_wanted': 123848638,
'bytes_limit': 123273216,
'durability': 'TRANSIENT'
},
'status': 429
}
We can see that the circuit breaker also indicates that this is a transient failure and clients can use this as a hint to retry the request after some time. Whether a circuit breaking exception is permanent or transient is decided based on reserved memory across all circuit breakers. Each circuit breaker type has an associated durability; if the majority of the reserved memory is reserved by circuit breakers that track transient memory usage, the real memory circuit breaker treats this as a transient condition, and otherwise as a permanent one.
Wrapping Up
While it is still possible in certain scenarios that an Elasticsearch node goes out of memory, the new real memory circuit breaker in Elasticsearch greatly improves resiliency of individual nodes by exercising backpressure based on actually measured memory usage instead of only accounting for memory tracked by circuit breakers. In our experiments, Elasticsearch could sustain workloads now that were far out of reach in previous versions and it will also make your production clusters much more resilient to peaks in your workload. To try out the new real memory circuit breaker, download the latest 7.0 beta release, take it for a spin, and give us some feedback.
The image at the top of the post has been provided by Kiran Raja Bahadur SRK under the CC BY-NC-ND 2.0 license (original source).