This Week in Elasticsearch and Apache Lucene - 2018-07-07
Elasticsearch
Highlights
We have added documentation for painless script contexts, which includes each place in the Elasticsearch APIs that a script may be used, as well as what variables are available in each of those contexts.
As part of our ingest node work, we have added a "bytes" processor that converts human readable byte sizes (e.g., 1kb) to raw byte sizes (e.g., 1024). The new processor has been merged and is targeted to the next 6.x minor.
We have opened a PR which will allow for more flexibility in how fields are selected for inclusion in “all” queries. This will remove the current limitation in which plugins cannot control whether or not fields are searched in an “all” query.
We have undertaken an effort to improve testability and test coverage of our cloud platforms integration. Examples include: clean up some repository-s3 tests , merge AwsS3Service and InternalAwsS3Service in a S3Service class and Merge AzureStorageService and AzureStorageServiceImpl and clean up tests.
We recently enhanced our support for AWS session tokens by adding support for 3-part credentials. With MFA-secured AWS access, you use your permanent (2-part) credentials plus the MFA code to obtain a different set of temporary (3-part) credentials which permit access to the desired resources. Today, Elasticsearch can obtain temporary credentials from the EC2 metadata service but they cannot be supplied by the user as would be needed for use outside of EC2. In 6.4.0 Elasticsearch gains support for three-part temporary credentials supplied by the user, which means that, via the repository-s3 plugin, it's possible to snapshot and restore to a MFA-secured S3 bucket from outside of EC2.
Changes in 5.6:
- Propagate mapping.single_type setting on shrinked index #31811
Changes in 6.3:
- SQL: Allow long literals #31777
- JDBC: Fix stackoverflow on getObject and timestamp conversion #31735
- SQL: Fix incorrect message for aliases #31792
- Watcher: Fix check for currently executed watches #31137
Changes in 6.4:
- REST high-level client: add get index API #31703
- Fix handling of points_only with term strategy in geo_shape #31766
- Watcher: Consolidate setting update registration #31762
- Fix not waiting for Netty ThreadDeathWatcher in IT #31758
- Fix not waiting for Netty ThreadDeathWatcher in IT (#31758) #31789
- Add analyze API to high-level rest client #31577
- REST high-level client: add cluster get settings API #31706
- Implemented XContent serialisation for GetIndexResponse #31675
- Fixture for Minio testing #31688
- ingest: Introduction of a bytes processor #31733
- Fix coerce validation_method in GeoBoundingBoxQueryBuilder #31747
- Add support for AWS session tokens #30414
- resolveHasher defaults to NOOP #31723
- Split CircuitBreaker-related tests #31659
- Add write*Blob option to replace existing blob #31729
- Watcher: Fix chain input toXcontent serialization #31721
- Extend allowed characters for grok field names (#21745) (#31653) #31722
Changes in 7.0:
- Remove support for deprecated StoredScript contexts #31394
- Account for XContent overhead in in-flight breaker #31613
- has_parent builder: exception message/param fix #31182
Lucene
Reclaiming deletes through merges
Today, the default merge policy, called TieredMergePolicy, exposes an opaque 'reclaimDeletesWeight' parameter to configure how aggressively deletes should be reclaimed. Its value is used in the function that scores merges. Unfortunately, values of this parameter don't mean much, only larger values will reclaim deleted documents more aggressively at the expense of more I/O. There is a suggestion that we replace it with a new 'indexPctDeletedTarget' parameter, which defines the maximum percentage of deleted documents that the index may have, which is much easier to reason about.
Other
- Lucene 6.6.5 was released. This release contains no changes and was done because Lucene and Solr must be released at the same time and Solr needed to do a bugfix release because of an XXE vulnerability.
- Recent refactorings to TieredMergePolicy introduced some subtle bugs.
- Discussing a suggestion that the unused PostingsEnum#attributes API gets removed.
- We are exploring how the matches API could be improved to allow for better highlighting by exposing information about matching terms.
- Discussing a proposal to clean up access to slices in IndexSearcher. Slices are a subset of the segments of an index, which are searched concurrently. IndexSearcher merges results in the end using TopDocs#merge.
- We noticed that merges would include hard deletes when counting the number of soft deletes.
- Discussing an old issue about adding a much needed expansion limit to SpanMultiTermQueryWrapper.
- Added a new helper method to create an iterator over a range of doc ids, which could typically represent the set of matches of a range query over a field that is used for index sorting.