This Week in Elasticsearch and Apache Lucene - Cluster Cloning in Hosted Elasticsearch
Welcome to This Week in Elasticsearch and Apache Lucene! With this weekly series, we're bringing you an update on all things Elasticsearch and Apache Lucene at Elastic, including the latest on commits, releases and other learning resources.
Top News
Just how easy is it clone your cluster in our hosted #Elasticsearch service? (Hint: very) https://t.co/I3ySOOFz04 pic.twitter.com/M9HoK1KTgw
— elastic (@elastic) January 29, 2016
Elasticsearch Core
Changes in 2.2:
- Closures are once again allowed in the Groovy scripting plugin, and a PR has been submitted to Groovy to remove the need for the supressAccessChecks permis
sion. - Translog recovery is fast again.
- Command line options now work correctly on Windows.
- The Tribe node wasn't passing on a custom --path.conf to its node clients, which resulted in security exceptions.
- Query and top-level inner hits results shouldn't overwrite each other.
- Geo-shapes didn't work in the percolator if map_unmapped_fields_as_string was enabled.
Changes in 2.x:
- Disabled fielddata loading was silently ignored on empty indices.
- Throw an exception if the Lucene version is not the expected one.
- Include the exception name for not serializable exceptions.
Changes in master:
- Shard failure requests for no longer existing shards should always be considered successful.
- The term-level fuzzy query is deprecated in favour of the match query with the fuzziness parameter.
- Boolean settings in mappings are now strict.
- The "index" mapping param now accepts only true/false.
- Setting index: false will no longer disable doc values as well.
- Doc values are controlled by the doc_values setting only, not by fielddata_format.
- Many more global settings have been migrated to the new settings infrastructure.
- The new scripting language is called Painless.
- Deep pagination is now possible with the search_after parameter.
- The ingest node should make deep copies of data structures.
- Disabled the ability to fsync on every operation (instead of every request) and only schedule fsync if really needed.
- Shards are marked as active during recovery, to ensure the indexing buffer is big enough.
- TermVector APIs no longer update mappings.
- Tracking of parent tasks now include master node, replication, and broadcast actions.
- Load average info has been normalised across different OSes.
- Improve exceptions from ingest pipelines.
- Ensure all resources are closed when closing a node.
Ongoing:
- Work has started on using UUIDs to identify indices on the file system, rather than relying just on index names.
- The reindex API is starting to use the task management framework.
- Search refactoring continues with the suggesters and sorting. A design bug in aggs refactoring (creating one instance per node instead of per shard) will require quite a big change.
- Updating mappings with update_all_types isn't working correctly.
Apache Lucene
- We plan to do a 5.5.0 release soon, to get all backported 5.x features out to the world, and to also debug the release process with git, and then get 6.0.0 release started
- Java 9 changes the API for un-mapping previously memory mapped pages, but users still risk a SIGSEGV when they try to use an
IndexReader
after it's closed - The switch from subversion to git has a long wiggling tail: lots of build fixes, including detecting if you changed git branches and forcing a clean build if so to prevent scary looking false test failures; our developer resources page now reflects the switch; we share notes on what long series of git commands seem to work; we fixed our shadow maven build to also switch from subversion to git; and we discuss the joy of merge bubbles
- 800+ new top-level-domains have been created since we last fixed
StandardTokenizer
to detect them! - Improved test coverage for the new point values (coming soon in Lucene 6.0.0) has uncovered missing heroics in its exception handling
- The new postings-based geo queries are ready to graduate out of the sandbox module, which provides no backwards compatibility
- The new divergence from independence similarity continues to wreak havoc on tests
- Add a more accurate "does polygon intersect rectangle" method to fix recent test failures uncovered by randomized geo tests
- Improve geo tests to confirm that the quantized encoding is stable and its error falls within the claimed tolerance
MemoryIndex
now has sugar methods to directly create aMemoryIndex
from a document or fields- All
TermsQuery
constructors are now efficient, avoiding creating lots of temporaryTerm
objects IndexableFielid.tokenStream
no longer throwsIOException
- Fix the build again to detect if running tests incorrectly results in source code changes!
- Scary looking test failures turned out to just be consumers abusing
IndexInput
by sharing an instance across threads without cloning first LuceneTestCase
now uses standardized language tags to represent the randomizedLocale
- Some nice performance gains are coming to geo point queries by customizing how terms are created from the geohashes
- The points based and postiongs based geo implementations use different encodings with different quantization errors
- The "exotic" rectangles selected by point values (BKD tree) still cause problems for the lat/lon 2D geo apis
- The complex
WordDelimiterFilter
sometimes produces incorrect tokens - Should we enable storing a Lucene index in Mongo DB?
Watch This Space
Stay tuned to this blog, where we'll share more news on the whole Elastic ecosystem including news, learning resources and cool use cases!