Indices, types, and parent / child: current status and upcoming changes in Elasticsearch
A year and a few months ago, we blogged about the differences of types and indices, including "when to pick which." If you don't wish to read the full history, I'll give you the TL;DR: the conclusion was something like:
Multiple types in the same index really shouldn't be used all that often and one of the few use cases for types is parent child relationships.
Sadly, not everyone has stumbled on all of our blogs (shameless plug!), which means a lot of our users have gone on using types for what they were never really intended for. This raised a very good engineering question: should we continue this confusing "type" construct? Should we just (continue) recommending against it? Should we give the real use case of parent-child a proper first-class citizen in the Elasticsearch hierarchy? The good news is we have a much less confusing future that we're moving forward on. The bad news is eventually this will be a completely breaking change which means you will need to make application-side modifications as soon as you can, though we're really, really trying extra hard to make it as easy for you as possible. First, let's talk about why the change has come about.
Why
Elastic (the company behind Elasticsearch) has seen thousands of support cases opened by our users and customers about the problems they've run into. Because of this, it puts us in a unique situation to see what sort of common problems people run into with Elasticsearch. With respect to types, this has come down to 3 real key elements:
- User expectations: This is probably the most prominent issue and Elastic is unfortunately at least partially to blame: once a bad analogy is out there, it is nearly impossible to kill. At some points in the past, we have stated that Elasticsearch indices were like traditional RDBMS databases while Elasticsearch types were similar to tables. However this was really an oversimplification of reality and as a result many people have the mental model that types are equivalent to tables in the relational world. In reality, in Elasticsearch, the underlying data structures are the same for the entire index, not per type. Due to the misconceptions, one of the most common pitfalls we've seen is that users expect fields to be independent across types. However, they must be of the same field type. So /my-index/type-a/my-field must use the same data type as /my-index/type-b/my-field.
- Sparsity: Sparsity should be avoided! While Lucene 7 (which has recently merged into master) improves the handling of sparsity, it should still be avoided where possible. Types almost always increase the sparsity of your data because different types have different fields. So by removing types there is one less pitfall out there waiting to get you at some point.
- Scoring: Documents are scored by index and not by type so storing different entities in the same index can interfere with the relevance calculation for each entity type. Again, this is a bit counter-intuitive and many users miss this, which is another potential pitfall.
What
As the one main use case derived from types vs indices is parent-child relationships, we have decided to supersede the "type" with a special field that stores the relationship between documents. We feel this represents a much better feel of the data. However, doing so is complicated, which gets us to the "when" element...
When
We want the transition away from "types" to be as smooth as can be, so we're targeting long, multi-phase deprecation process to get us there. At the time of this writing, the current engineering targets for these are something like the following:
- In 5.x, add a new feature (index.mapping.single_type set to true/false), which will allow you to preview what the type removal will start to look like. This will be great for anybody that has a separate test environment and/or wants to start testing early.
- We plan to introduce a new breaking change currently targeted for 6.0 to make it so new indices will only allow a single type to be created to help you get better prepared for 7.x. Don't worry though -- the multi-type indices you created in 5.x will continue to work as before in 6.x. This phased roll-in is intended to give you some phase-in time as you upgrade without client-side breaks. In addition, we plan to:
- Make _uid consistent with this change by removing the type from it.
- Add a new feature for "typeless parent/child fields" called "join fields".
- Provide typeless URLs in preparation for the migration to 7.0.
- We currently plan to make the final breaking change related to this in 7.0, by removing types entirely from the Elasticsearch APIs
What does this mean for me?
The answer to this question depends on what use case you're using the Elastic Stack for:
- Most logging & security analytics users will find the transition completely seamless: Beats and Logstash don't generally use types and where they do, there's common alignment on the teams to try make the transition work without you thinking about it.
- If you're using Elasticsearch as a search and/or document datastore/database, you'll want to review your type usage, especially your parent/child usage.
Of course, we recommend people get ahead of whatever they can by adopting things like this early. If you're looking to do so, you can use Kibana's Console / dev-tools to reindex any particular data moving data with a "_type" field into a "type" field.
POST _reindex { "source": { "index": "old" }, "dest": { "index": "new" }, "script": { "inline": """ ctx._id = ctx._type + "-" + ctx._id; ctx._source.type = ctx._type; ctx._type = "doc"; """ } }
This moves the special "_type" field over to "type" which you can then use in subsequent filtering, aggregations, etc.
I have more questions! Tell me more!
Step 1: don't panic :)
Step 2: please let us know on our forums
We're actively looking for any particular problems this may cause with your use case, so if you think you may have some, please talk to us!