Upgrading Elasticsearch to a new AWS Java SDK

blog-aws-sdk.jpg

Elasticsearch integrates with certain Amazon Web Services (AWS) features using the official AWS Software Development Kit (SDK) for Java. These integrations were introduced all the way back in Elasticsearch version 2.0, released almost 10 years ago.

Recently AWS announced that the SDK that Elasticsearch has used for the last decade will reach the end of its supported life on December 31, 2025. At Elastic, we have a responsibility to our users to depend only on supported components, so we were compelled to migrate to the newer AWS SDK for Java v2 before this date.

The newer SDK is not a drop-in replacement for the older one, and it behaves differently in several ways. This article tells the story of the project to migrate to the newer SDK, the steps we took to insulate Elasticsearch users from the differences in behavior, and the actions you might need to take to adapt to the newer SDK.

All Elasticsearch versions in the 8.19 minor series, and all versions from 9.1.0 onward, will use the new AWS SDK for Java v2.

What is an SDK?

Web service providers such as AWS describe their services in terms of the application programming interface (API) that those services expose. The API documentation describes how to construct requests to achieve particular outcomes and how to interpret responses to these requests (whether successful or otherwise). For instance, the API documentation for AWS S3 describes exactly how to construct a request to upload an object and then to download it again.

Most service providers also provide SDKs for common programming languages, allowing developers to interact with their services without needing to understand the low-level details of the APIs. Each SDK implements the common behaviors and conventions needed to interact with a service and adapts them to the programming environment in which developers are working. Rather than constructing requests and interpreting responses, developers call functions in the SDK and leave the details of how these functions work to the API experts at the web service provider.

AWS provides an SDK for Java that encapsulates knowledge about how AWS services all fit together and the environment in which they operate and maps all this functionality onto features familiar to Java developers. For example, it automatically converts API errors into Java exceptions to simplify the programming experience. Moreover, a process running on an AWS EC2 instance and using the AWS SDK for Java to interact with AWS S3 will by default be able to determine most of its configuration automatically, simplifying the end-user experience.

Since version 2.0.0, Elasticsearch has used the AWS SDK for Java in two main areas: storing and retrieving snapshots in AWS S3, and discovering other cluster members using the EC2 DescribeInstances API.

Why did AWS create a second SDK for Java?

AWS’s original SDK for Java was released in March 2010, and the first version of Elasticsearch to use this SDK was released in October 2015. As the years passed, it became clear that some of the design decisions made for the original SDKv1 were not standing up to the test of time and could not be fixed without breaking compatibility with existing client code. AWS announced the release of a brand-new SDK for Java, known as SDKv2, in November 2018. The new SDK was more flexible, supporting asynchronous request execution, a choice of HTTP clients, and the ability to share HTTP clients across SDK client instances for more efficient connection reuse. It was also less lenient, for instance requiring developers to specify the AWS region in which the service operates, in contrast to SDKv1, which would do its best to guess the region if unspecified.

At the time, the Elasticsearch development team evaluated SDKv2 and decided that none of the new functionality was crucial for Elasticsearch’s needs, and some of the differences from SDKv1 would have caused user-visible breaking changes. So in the interests of our users, we decided to continue to use SDKv1.

Why is Elasticsearch moving to SDKv2?

This all changed in 2024 when Amazon announced that SDKv1 would reach the end of its supported life on December 31, 2025. Although this SDK would continue to work after this date, there would be no further releases even if a critical bug or security issue were discovered.

Security and quality are very important to Elastic, and we cannot accept the risk of an unpatched security issue or other bug in an SDK on which we depend, so we decided we had to migrate to SDKv2 despite the user-visible consequences of this migration.

Unfortunately, the end of life of SDKv1 was announced part way through the 8.x series of Elasticsearch releases, a series which we will continue to maintain until mid-2027. We were faced with a difficult choice: switch to SDKv2 in an 8.x minor release, or else continue to use the SDKv1 for the rest of the 8.x series while using SDKv2 in the 9.x series.

The former would mean introducing breaking changes to these widely used components of Elasticsearch in an 8.x minor release. The latter would mean 18 risky months of 8.x support after the end of the maintained life of SDKv1 hoping that any newly-discovered security issues can be worked around without needing a new SDKv1 release.

Instead we found a different path, allowing us to migrate to SDKv2 in a minor release and introduce compatibility logic to work around many of the differences in behavior. Thanks to the extra compatibility logic, most users will need to take no action when upgrading to an Elasticsearch version based on SDKv2.

How did Elasticsearch move to SDKv2?

It seems like the migration should have been a simple task: Just remove any dependencies on SDKv1, add dependencies on SDKv2, fix up any compile errors, check the tests all pass, and we’re done.

Of course nothing is ever so simple. Elasticsearch has an extensive test suite that checks the code that interacts with AWS services in great detail. Some of these tests were very closely tied to SDKv1 in ways that simply did not have an equivalent in SDKv2. For example, we were verifying the client configuration was correct by inspecting the contents of a com.amazonaws.ClientConfiguration object, but this object has no counterpart in SDKv2. Likewise, we had tests that verified we were capturing metrics from the SDK correctly, and the metric-collection mechanism is substantially different in SDKv2.

We could have fixed up all the tests at the same time as making the production code changes, but this exposes us to the risk that we might introduce bugs into the production code and complementary bugs in the test code that would hide the production bugs. Instead we chose to mitigate this risk by first reworking many of the tests to make them as independent as possible of the choice of SDK version.

We already had some tests that verified the behavior of Elasticsearch in terms of the requests it sends over the network to the AWS APIs and its handling of the corresponding responses. These tests worked by running an HTTP server that reimplements the relevant parts of the AWS APIs, such that the tests could control the behavior of the APIs.

We originally developed these tests to allow us to verify that Elasticsearch behaves as expected when the AWS APIs fail, time out, or apply backpressure. We cannot verify these conditions when running tests against the real AWS APIs because the real APIs almost never fail, and they certainly do not fail predictably enough to be useful for our tests.

This style of end-to-end testing is naturally independent of the SDK version that Elasticsearch uses, since both SDK versions ultimately send the same network requests to the same AWS APIs. We focused on migrating more of the test suite to be end-to-end tests that send real HTTP requests over the network rather than focusing on the SDK-specific implementation details within Elasticsearch. In fact, even with the end-to-end tests, we found some inconsistencies between the behavior of SDKv1 and SDKv2 to which we had to adapt our test fixtures.

We then opened a long-running feature branch so that the whole team could work on migration tasks in parallel without affecting the production codebase.

Once we had everything compiling and passing the tests, we reviewed the whole changeset and identified several SDK-independent parts of the changes that we could port into the main Elasticsearch codebase while it was still based on SDKv1. This work helped to reduce the size of the overall changeset and keep it as focused as possible on just the changes needed to upgrade to SDKv2.

As development progressed, we kept track of specific incompatibilities between SDKv1 and SDKv2. The end-to-end test suite was particularly good at finding these incompatibilities, and our focus on trying to reduce the changes to the test suite gave us confidence that we were not missing any. After the changes were merged, we released the updated version internally and a downstream system found one other place in which SDKv2 was stricter than SDKv1 that the Elasticsearch test suite didn’t cover, which we quickly mitigated and fixed.

What has changed?

We expect most clusters will continue to work after upgrading to a version based on SDKv2 without any changes to their configuration because of the compatibility mechanisms we added to Elasticsearch to support users through such upgrades.

Although you may not need to adjust your configuration immediately, we have also changed some best-practice recommendations about how to configure your Elasticsearch clusters, particularly about the s3.client.${CLIENT_NAME}.region, s3.client.${CLIENT_NAME}.endpoint, and s3.client.${CLIENT_NAME}.protocol settings. These recommendations may become mandatory in a future release. The rest of this section describes the differences of which you may need to be aware after upgrading to a SDKv2-based version of Elasticsearch.

Region auto-detection

AWS is divided into over 30 independent regions. Clients usually send API requests to a region-specific HTTPS endpoint, and each request includes an authentication signature whose value depends on the region name.

SDKv1 automatically selects an appropriate region name using heuristics that depend on the environment in which Elasticsearch is running and on the endpoint to which you have configured it to connect. SDKv2 removes most of these heuristics and instead expects the calling code to specify the region.

As part of the migration to SDKv2, Elasticsearch introduces heuristics similar to SDKv1 so that you should not need to adjust your configuration to specify the region when upgrading your Elasticsearch clusters. In particular, when using AWS S3 to store snapshots, Elasticsearch will continue to heuristically determine the region from the S3 endpoint you have specified, at least for all regions known at the time of writing.

Nonetheless, we recommend that you configure the s3.client.${CLIENT_NAME}.region setting for each S3 client, rather than relying on heuristics. There may be some situations in which the new heuristics do not work identically to the ones in SDKv1, and in this case you will have to configure s3.client.${CLIENT_NAME}.region to specify the correct region.

If you are using the s3 repository type to store snapshots in your own S3-compatible storage, then your storage administrator will be able to tell you the correct region name to use. Many such storage installations will use us-east-1 for the region name.

IMDSv1 support

Instances running in AWS EC2 may access metadata about themselves using the Instance Metadata Service (IMDS). This metadata includes details of the region and availability zone in which the instance is running, as well as IAM credentials for roles attached to the instance.

There are two protocols for requesting information from IMDS, known as IMDSv1 and IMDSv2. The newer IMDSv2 was introduced to mitigate some security risks discovered in the older IMDSv1 protocol. SDKv2 does not support IMDSv1, so newer versions of Elasticsearch will only use IMDSv2. All EC2 instances support IMDSv2, so this should cause no problems for instances running on AWS infrastructure. But if you are running Elasticsearch in an environment that provides an independent IMDS which is intended to be compatible with EC2 IMDS, then you must ensure it supports the IMDSv2 protocol.

Protocol choice

SDKv1 permits you to set an endpoint address of the form ”hostname.domain.com” or ”hostname.domain.com:port” and then separately to choose between using insecure HTTP or secure HTTPS protocols to communicate with the endpoint. Older versions of Elasticsearch expose this choice via the discovery.ec2.protocol and s3.client.${CLIENT_NAME}.protocol settings.

SDKv2 requires endpoint addresses to be full absolute URLs that start with either http:// or https://. As a temporary compatibility measure, Elasticsearch will continue to qualify a bare S3 endpoint according to the s3.client.${CLIENT_NAME}.protocol setting (which defaults to using HTTPS). But this behavior is deprecated, and the next major version of Elasticsearch will require the S3 endpoint address to be an absolute URL that starts with http:// or https://.

System property removal

SDKv1 may read the aws.secretKey and com.amazonaws.sdk.ec2MetadataServiceEndpointOverride system properties, but SDKv2 does not. It would be very unusual to be setting these system properties in an Elasticsearch installation, but we mention them here for the sake of completeness.

Feature removal

SDKv2 does not permit control over throttling for retries, so the s3.client.${CLIENT_NAME}.use_throttle_retries setting is deprecated and no longer has any effect. It is very unlikely that you are using this setting in a production cluster. SDKv2 applies a throttling policy that encapsulates AWS’s recommended best practices by default.

SDKv2 requires the use of the V4 signature algorithm, removing support for much older algorithms. So the s3.client.${CLIENT_NAME}.signer_override setting is deprecated and no longer has any effect. It is very unlikely that you are using this setting in a production cluster.

SDKv2 does not support the log-delivery-write canned ACL for objects stored in S3. This ACL applies only to buckets, not objects, so it would not have had the desired effect even with older Elasticsearch versions that use SDKv1. It is very unlikely that you are using this ACL.

Metrics reporting

Elasticsearch collects various metrics about the API calls it makes to S3 endpoints using the functionality built into the SDK. These metrics are mostly used for internal purposes but may be visible in some APIs. SDKv1 counts some responses with a 4xx HTTP status code as if they had not even been received by the API in the first place, whereas SDKv2 counts these responses identically to other error responses. You may see a subtle difference in the S3 API metrics that Elasticsearch collects.

Choice of STS endpoint

SDKv1 could use either a regional endpoint or the global https://sts.amazonaws.com endpoint in order to obtain short-lived session credentials. SDKv2 will only use regional endpoints.

Other differences

The AWS SDK documentation includes information about many other differences between SDKv1 and SDKv2. The differences that aren’t called out in this blog post should be handled within Elasticsearch and shouldn’t be relevant to end-users. However, given the large number of changes between the two SDK versions and the large variety of environments in which Elasticsearch runs, it is possible that you may need to adapt some other aspects of your configuration when upgrading to an Elasticsearch version that uses the newer SDK.

Upgrade today

Elasticsearch versions 8.19.x and all versions from 9.1.0 onward use the newer AWS SDK for Java v2. Upgrade to one of these versions before the end of 2025 to avoid using the older AWS SDK for Java v1 past the end of its supported life.

Be aware that the newer SDK has some behavioral differences that may require small configuration adjustments in some special cases. Always upgrade your test clusters and verify they work correctly before you upgrade any of your production clusters.

The release and timing of any features or functionality described in this post remain at Elastic's sole discretion. Any features or functionality not currently available may not be delivered on time or at all.