This Week in Elasticsearch and Apache Lucene - 2020-03-13
Elasticsearch
Async search
Async search APIs have landed in master. They are still in the incubation phase as we address some flaky tests discovered in our CI after the merge, but the feature remains geared towards a first release in 7.7. The Kibana team can now work on the integration of these new APIs from the main development branch. Lukas from the Kibana team is working closely with us to move all the blocking search requests in Kibana to asynchronous calls. This will reveal their power gradually. Starting in 7.7, users will be able to bypass the default timeout of 30 seconds to visualize dashboards. In 7.8 and beyond, Kibana will use this API to allow building dashboards in background; queries will continue to run even if the user navigates away or closes the browser. We will also provide a management API so that Kibana can access and visualize these background tasks.
Data streams
Data Streams are a formalization of the concept of time series data in Elasticsearch. They can be thought of as a first-class grouping or container for time based indices that generally will share a common source. Rather then leveraging aliases and naming conventions to group similar data as we do today with ILM and indices, we will introduce new configuration, APIs and internal structures to better manage these streams of data. The new user-facing configuration and API will be minimal, familiar, and greatly help users get started ingesting streams of data. All data stream data will live in hidden indices, and the data stream's internal structures will allow operations to be performed on the set of indices that compose the data stream. ILM will also see a few modifications to work seamlessly with Data Streams.
Promoting the concept of time series data to a first class concept has many benefits both now and in the future. For example, if we know the timestamp field for a set of indices we can automatically sort the query results, configure the segments of shards to be sorted, or expose this information to Kibana for their time filter. We can make certain assumptions and abstractions that don't require the use of an alias which can prevent many misconfigurations. A data stream can correspond directly to a concept the user understands (e.g. their stream of MySQL error logs) which can be used by the user or system to make decisions. There are a lot of options opened up by treating this as a first-class concept in Elasticsearch.
Index templates v2
Index templates v2 is an evolution of index templates, a longstanding feature of Elasticsearch. The existing index templates have some undesired behavior mostly around how multiple templates can get merged together. Index templates v2's largest differentiating factor will be how multiple templates will get composed together.
A new concept of "component templates" will be introduced. A component template contains settings, mappings and aliases and is not directly associated with an index pattern. An index template (v2) will be able to compose multiple component templates together to form a single set of settings, mappings, and alias configurations. The order in which they are merged together is defined by the order in which they are specified. This allows for a pattern to create libraries of small component templates and then compose them together in new and different ways for different index patterns. Data streams will leverage index templates v2 allowing for data stream specific configuration.
Performance regression analysis
Our performance team spotted several regressions in our nightly benchmarks that after bisecting were linked with recent changes in Apache Lucene and in the cancelling of requests in Elasticsearch.
The introduction of compression in the binary doc values had negative effects on the performance of searching range fields. This is something that we expected while developing the feature since range fields use binary fields internally, but it's really nice to have confirmation through our nightly benchmarks. The nightlies also confirmed the gains, which is that the on-disk size of the index decreased slightly with the introduction of the compression.
The second regression is due to our improvements around query cancellation. Queries that use points (range query) and terms dictionary (terms and multi-term queries) now check if the query is cancelled more eagerly. This change impacted the 99% latency of our search benchmark, so we are now working on lowering this impact while keeping the benefit of regular checks.
Apache Lucene
Geometry queries
We have been working on some speedups to LatLonShape
that work by specializing how different shapes calculate their relationships with each other. Previously, all shapes were assumed to be triangles and decoded accordingly. Now, lines and points can save some CPU time by only decoding the dimensions they use.
Deprecating SimpleFSDirectory
We noticed that SimpleFSDirectory
essentially duplicates the behaviour of NIOFSDirectory
, but with added synchronization because of the way it uses internal state. Historically this state was there to correctly handle concurrent reads on Windows machines. However, synchronisation on Windows is now handled directly by the JDK, so there is no performance difference between the two Directory
implementations on Windows, and NIOFSDirectory
will always perform better on Mac and Linux systems. Given this performance disparity, and the fact that SimpleFSDirectory
is not, in fact, simple, we’ve decided to deprecate and remove it.
Changes
Breaking Changes in Elasticsearch
Breaking Changes in 8.0: