Mapping and Types


As explained in the previous sections, elasticsearch-hadoop integrates closely with the Hadoop ecosystem and perform close introspection of the type information so that the data flow between Elasticsearch and Hadoop is as transparent as possible. This section takes a closer look at how the type conversion takes place and how data is mapped between the two systems.

Converting data to Elasticsearch


By design, elasticsearch-hadoop provides no data transformation or mapping layer itself simply because there is no need for them: Hadoop is designed to do ETL and some libraries (like Pig and Hive) provide type information themselves; further more Elasticsearch has rich support for mapping out of the box including automatic detection, dynamic/schema-less mapping, templates and full manual control. Need to split strings into token, do data validation or eliminate unneeded data ? There are plenty of ways to do that in Hadoop before reading/writing data from/to Elasticsearch; need control over how data is stored in Elasticsearch ? Use Elasticsearch APIs to define the mapping, to update settings or add generic meta-data.

Time/Date mapping


When it comes to handling dates, Elasticsearch always uses the ISO 8601 format for date/time. This is the default date format of Elasticsearch - if a custom one is needed, please add it to the default option rather then just replacing it. See the date format section in elasticsearch-hadoop reference documentation for more information.

Automatic mapping


By default, Elasticsearch provides automatic index and mapping when data is added under an index that has not been created before. In other words, data can be added into Elasticsearch without the index and the mappings being defined a priori. This is quite convenient since Elasticsearch automatically adapts to the data being fed to it - moreover, if certain entries have extra fields, Elasticsearch schema-less nature allows them to be indexed without any issues.

It is important to remember that automatic mapping uses the payload values to identify its type, using the first document creates the mapping. elasticsearch-hadoop communicates with Elasticsearch through JSON which does not provide any type information, rather only the field names and their values. One can think of it as type erasure or information loss; for example JSON does not differentiate integer numeric types - byte, short, int, long are all placed in the same long bucket. this can have unexpected side-effects since the type information is guessed such as:

numbers mapped only as long/double

Whenever Elasticsearch encounters a number, it will allocate the largest type for it since it does not know the exact number type of the field. Allocating a small type (such as byte, int or float) can lead to problems if a future document is larger, so Elasticsearch uses a safe default. For example, the document:

    "tweet" {
        "user" : "kimchy",
        "message" : "This is a tweet!",
        "postDate" : "2009-11-15T14:12:12",
        "priority" : 4,
        "rank" : 12.3

triggers the following mapping:

{ "test" : {
    "mappings" : {
      "index" : {
        "properties" : {
          "message" : {
            "type" : "string"
          "postDate" : {
            "type" : "date",
            "format" : "dateOptionalTime"      
          "priority" : {
            "type" : "long"                    
          "rank" : {
            "type" : "double"                  
          "user" : {
            "type" : "string"

The postDate field was recognized as a date in ISO 8601 format (dateOptionalTime)

The integer number (4) was mapped to the largest available type (long)

The fractional number (12.3) was mapped to the largest available type (double)

incorrect mapping

This happens when a string field contains numbers (say 1234) - Elasticsearch has no information that the number is actually a string and thus it map the field as a number, causing a parsing exception when a string is encountered. For example, this document:

{ "array":[123, "string"] }

causes an exception with automatic mapping:

{"error":"MapperParsingException[failed to parse [array]]; nested: NumberFormatException[For input string: \"string\"]; ","status":400}

because the field array is initially detected as a number (because of 123) which causes "string" to trigger the parsing exception since clearly it is not a number. The same issue tends to occur with strings might be interpreted as dates.

Hence if the defaults need to be overridden and/or if you experience the problems exposed above, potentially due to a diverse dataset, consider using Explicit mapping.

Disabling automatic mapping


Elasticsearch allows automatic index creation as well as dynamic mapping (for extra fields present in documents) to be disabled through the action.auto_create_index and index.mapper.dynamic settings on the nodes config files. As a safety net, elasticsearch-hadoop provides a dedicated configuration option which allows elasticsearch-hadoop to either create the index or not without having to modify the Elasticsearch cluster options.

Explicit mapping


Explicit or manual mapping should be considered when the defaults need to be overridden, if the data is detected incorrectly (as explained above) or, in most cases, to customize the index analysis. Refer to Elasticsearch create index and mapping documentation on how to define an index and its types - note that these need to be present before data is being uploaded to Elasticsearch (otherwise automatic mapping will be used by Elasticsearch, if enabled).

In most cases, templates are quite handy as they are automatically applied to new indices created that match the pattern; in other words instead of defining the mapping per index, one can just define the template once and then have it applied to all indices that match its pattern.