Elasticsearch 6.1.0 released
Today we are pleased to announce the release of Elasticsearch 6.1.0, based on Lucene 7.1.0. This is the latest stable release, and is already available for deployment on Elastic Cloud, our Elasticsearch-as-a-service platform.
Latest stable release in 6.x:
You can read about all the changes in the release notes linked above, but there are a few changes which are worth highlighting:
As a companion to the Shrink Index API, we now have a Split Index API that allows you to split an existing index into a new index, where each original primary shard is split into two or more primary shards in the new index.
The split is done efficiently by hard-linking the data in the source primary shard into multiple primary shards in the new index, then running a fast Lucene Delete-By-Query to mark documents which should belong to a different shard as deleted. These deleted documents will be physically removed over time by the background merge process.
The split API can only be used on indices that have had the
index.number_of_routing_shards
setting specified at index creation time.
From 7.0, we plan to have this setting be set automatically: until then, this feature will only be available to new indices created on or after Elasticsearch 6.1.0.
Elasticsearch is designed to return the top-10 best search results or the top-50 most accessed destination pages in your web logs as fast as possible. This speed is part of the reason why Elasticsearch is so popular for analytics. However, sometimes you need to get back ALL terms and the top-N design of aggregations doesn’t allow this to happen efficiently on high cardinality fields.
The new composite
aggregation
is designed to make this possible. The composite
agg allows you to create terms
, histogram
, or date_histogram
composite buckets on one or more fields, sorted in "natural order", i.e. alphabetically for terms, and numerically or by date for the histograms.
Because these composite buckets are returned in sorted order, results can be
paged through efficiently in a similar manner to a scroll request. The first
search request could return the first 100 or 1000 buckets, then the next
tranche can be requested by passing the values of the last composite bucket in
the
after
parameter, and so on until all buckets have been retrieved.
An additional benefit to the composite
aggregation is that doc counts and
metric aggs directly under the
composite
aggregation are accurate for the cases where you need non-approximated counts, as we can
be sure that we have seen all documents for a particular composite bucket
(unlike the top-N model). While you can specify a further
terms
agg under
the
composite
agg, it will use the standard top-N model and return
approximate counts.
Today in Elasticsearch, a series of search requests to the same shard will be forwarded to the primary and each replica in round robin fashion. This can prove problematic if one node starts a long garbage collection — search requests will still be forwarded to the slow node regardless and will have an impact on search latency.
In 6.1, we have added an experimental featured called Adaptive Replica Selection. Each node tracks and compares how long search requests to other nodes take, and uses this information to adjust how frequently to send requests to shards on particular nodes. In our benchmarks, this results in an overall improvement in search throughput and reduced 99th percentile latencies.
This option is disabled by default as we are still fine-tuning how to compare different search requests and how to account for differences due to caching, but the results we are seeing are very promising. You can enable or disable this feature at runtime by updating a dynamic cluster setting, so it is worth trying this out in your environment. If you do so, we would love to hear about your results.
Each document indexed in Elasticsearch includes a _fields
metafield, which
lists the fields contained in that document. This is needed to support the
exists
query. It turns out that this simple feature is surprisingly costly.
We have since reworked the
exists
query to use doc-values or norms as a
proxy for
_fields
, which limits the need for the _fields
metafield to only
those fields that have neither doc-values nor norms. This simple change has
resulted in a massive 15% increase in indexing throughput in our benchmarks,
with no loss of functionality.
Scripted Similarities
Elasticsearch now uses BM25 scoring instead of TF/IDF, which is going to be removed. That said, some people still want to use TF/IDF, and some people would like to have more control over scoring such as disabling term frequency or inverse document frequency. Previously, the only way to have such control was to write an Elasticsearch plugin. This has become much easier thanks to scripted similarities. Now, you can write your own custom similarity using Painless. The linked docs demonstrate how to recreate TF/IDF with two simple scripts.
Watcher run_as support
Up until now, watches have been executed as an internal X-Pack user when security is enabled, which allowed the watch to access any index that the X-Pack user has access to. Starting in 6.1.0, search inputs, search transforms, and index actions will instead be run as the user who created (or last updated) the watch. This will limit the watch's privileges to those of the user: if the user can't read index foo
, then neither can the watch. Elevated permissions can still be requested in a watch by using the run_as
privilege. Existing watches will continue to run as the X-Pack user until they are updated.
Conclusion
Please download Elasticsearch 6.1.0, try it out, and let us know what you think on Twitter (@elastic) or in our forum. You can report any problems on the GitHub issues page.