Field data formats

edit

The field data format controls how field data should be stored.

Depending on the field type, there might be several field data types available.

Here is an example of how to configure the tag field to use the fst field data format.

{
    tag: {
        type:      "string",
        fielddata: {
            format: "fst"
        }
    }
}

It is possible to change the field data format (and the field data settings in general) on a live index by using the update mapping API. When doing so, field data which had already been loaded for existing segments will remain alive while new segments will use the new field data configuration. Thanks to the background merging process, all segments will eventually use the new field data format.

String field data types

edit
paged_bytes (default)
Stores unique terms sequentially in a large buffer and maps documents to the indices of the terms they contain in this large buffer.
fst
Stores terms in a FST. Slower to build than paged_bytes but can help lower memory usage if many terms share common prefixes and/or suffixes.

Numeric field data types

edit
array (default)
Stores field values in memory using arrays.

Geo point field data types

edit
array (default)
Stores latitudes and longitudes in arrays.

Fielddata loading

edit

By default, field data is loaded lazily, ie. the first time that a query that requires them is executed. However, this can make the first requests that follow a merge operation quite slow since fielddata loading is a heavy operation.

It is possible to force field data to be loaded and cached eagerly through the loading setting of fielddata:

{
    category: {
        type:      "string",
        fielddata: {
            loading: "eager"
        }
    }
}

Disabling field data loading

edit

Field data can take a lot of RAM so it makes sense to disable field data loading on the fields that don’t need field data, for example those that are used for full-text search only. In order to disable field data loading, just change the field data format to disabled. When disabled, all requests that will try to load field data, e.g. when they include facets and/or sorting, will return an error.

{
    text: {
        type:      "string",
        fielddata: {
            format: "disabled"
        }
    }
}

The disabled format is supported by all field types.

Filtering fielddata

edit

It is possible to control which field values are loaded into memory, which is particularly useful for string fields. When specifying the mapping for a field, you can also specify a fielddata filter.

Fielddata filters can be changed using the PUT mapping API. After changing the filters, use the Clear Cache API to reload the fielddata using the new filters.

Filtering by frequency:

edit

The frequency filter allows you to only load terms whose frequency falls between a min and max value, which can be expressed an absolute number or as a percentage (eg 0.01 is 1%). Frequency is calculated per segment. Percentages are based on the number of docs which have a value for the field, as opposed to all docs in the segment.

Small segments can be excluded completely by specifying the minimum number of docs that the segment should contain with min_segment_size:

{
    tag: {
        type:      "string",
        fielddata: {
            filter: {
                frequency: {
                    min:              0.001,
                    max:              0.1,
                    min_segment_size: 500
                }
            }
        }
    }
}

Filtering by regex

edit

Terms can also be filtered by regular expression - only values which match the regular expression are loaded. Note: the regular expression is applied to each term in the field, not to the whole field value. For instance, to only load hashtags from a tweet, we can use a regular expression which matches terms beginning with #:

{
    tweet: {
        type:      "string",
        analyzer:  "whitespace"
        fielddata: {
            filter: {
                regex: {
                    pattern: "^#.*"
                }
            }
        }
    }
}

Combining filters

edit

The frequency and regex filters can be combined:

{
    tweet: {
        type:      "string",
        analyzer:  "whitespace"
        fielddata: {
            filter: {
                regex: {
                    pattern:          "^#.*",
                },
                frequency: {
                    min:              0.001,
                    max:              0.1,
                    min_segment_size: 500
                }
            }
        }
    }
}

Settings before v0.90

edit
Setting Description

index.cache.field.type

The default type for the field data cache is resident (because of the cost of rebuilding it). Other types include soft

index.cache.field.max_size

The max size (count, not byte size) of the cache (per search segment in a shard). Defaults to not set (-1).

index.cache.field.expire

A time based setting that expires filters after a certain time of inactivity. Defaults to -1. For example, can be set to 5m for a 5 minute expiry.

Monitoring field data

edit

You can monitor memory usage for field data using Nodes Stats API