Tutorial: Data stream retention
editTutorial: Data stream retention
editIn this tutorial, we are going to go over the data stream lifecycle retention; we will define it, go over how it can be configured and how it can gets applied. Keep in mind, the following options apply only to data streams that are managed by the data stream lifecycle.
You can verify if a data steam is managed by the data stream lifecycle via the get data stream lifecycle API:
resp = client.indices.get_data_lifecycle(
name="my-data-stream",
)
print(resp)
response = client.indices.get_data_lifecycle( name: 'my-data-stream' ) puts response
const response = await client.indices.getDataLifecycle({
name: "my-data-stream",
});
console.log(response);
GET _data_stream/my-data-stream/_lifecycle
The result should look like this:
What is data stream retention?
editWe define retention as the least amount of time the data of a data stream are going to be kept in Elasticsearch. After this time period has passed, Elasticsearch is allowed to remove these data to free up space and/or manage costs.
Retention does not define the period that the data will be removed, but the minimum time period they will be kept.
We define 4 different types of retention:
-
The data stream retention, or
data_retention, which is the retention configured on the data stream level. It can be set via an index template for future data streams or via the PUT data stream lifecycle API for an existing data stream. When the data stream retention is not set, it implies that the data need to be kept forever. -
The global default retention, let’s call it
default_retention, which is a retention configured via the cluster settingdata_streams.lifecycle.retention.defaultand will be applied to all data streams managed by data stream lifecycle that do not havedata_retentionconfigured. Effectively, it ensures that there will be no data streams keeping their data forever. This can be set via the update cluster settings API. -
The global max retention, let’s call it
max_retention, which is a retention configured via the cluster settingdata_streams.lifecycle.retention.maxand will be applied to all data streams managed by data stream lifecycle. Effectively, it ensures that there will be no data streams whose retention will exceed this time period. This can be set via the update cluster settings API. -
The effective retention, or
effective_retention, which is the retention applied at a data stream on a given moment. Effective retention cannot be set, it is derived by taking into account all the configured retention listed above and is calculated as it is described here.
Global default and max retention do not apply to data streams internal to elastic. Internal data streams are recognised
either by having the system flag set to true or if their name is prefixed with a dot (.).
How to configure retention?
edit-
By setting the
data_retentionon the data stream level. This retention can be configured in two ways:— For new data streams, it can be defined in the index template that would be applied during the data stream’s creation. You can use the create index template API, for example:
resp = client.indices.put_index_template( name="template", index_patterns=[ "my-data-stream*" ], data_stream={}, priority=500, template={ "lifecycle": { "data_retention": "7d" } }, meta={ "description": "Template with data stream lifecycle" }, ) print(resp)const response = await client.indices.putIndexTemplate({ name: "template", index_patterns: ["my-data-stream*"], data_stream: {}, priority: 500, template: { lifecycle: { data_retention: "7d", }, }, _meta: { description: "Template with data stream lifecycle", }, }); console.log(response);PUT _index_template/template { "index_patterns": ["my-data-stream*"], "data_stream": { }, "priority": 500, "template": { "lifecycle": { "data_retention": "7d" } }, "_meta": { "description": "Template with data stream lifecycle" } }— For an existing data stream, it can be set via the PUT lifecycle API.
resp = client.indices.put_data_lifecycle( name="my-data-stream", data_retention="30d", ) print(resp)response = client.indices.put_data_lifecycle( name: 'my-data-stream', body: { data_retention: '30d' } ) puts responseconst response = await client.indices.putDataLifecycle({ name: "my-data-stream", data_retention: "30d", }); console.log(response); -
By setting the global retention via the
data_streams.lifecycle.retention.defaultand/ordata_streams.lifecycle.retention.maxthat are set on a cluster level. You can be set via the update cluster settings API. For example:resp = client.cluster.put_settings( persistent={ "data_streams.lifecycle.retention.default": "7d", "data_streams.lifecycle.retention.max": "90d" }, ) print(resp)const response = await client.cluster.putSettings({ persistent: { "data_streams.lifecycle.retention.default": "7d", "data_streams.lifecycle.retention.max": "90d", }, }); console.log(response);PUT /_cluster/settings { "persistent" : { "data_streams.lifecycle.retention.default" : "7d", "data_streams.lifecycle.retention.max" : "90d" } }
How is the effective retention calculated?
editThe effective is calculated in the following way:
-
The
effective_retentionis thedefault_retention, whendefault_retentionis defined and the data stream does not havedata_retention. -
The
effective_retentionis thedata_retention, whendata_retentionis defined and ifmax_retentionis defined, it is less than themax_retention. -
The
effective_retentionis themax_retention, whenmax_retentionis defined, and the data stream has either nodata_retentionor itsdata_retentionis greater than themax_retention.
The above is demonstrated in the examples below:
default_retention |
max_retention |
data_retention |
effective_retention |
Retention determined by |
|---|---|---|---|---|
Not set |
Not set |
Not set |
Infinite |
N/A |
Not relevant |
12 months |
30 days |
30 days |
|
Not relevant |
Not set |
30 days |
30 days |
|
30 days |
12 months |
Not set |
30 days |
|
30 days |
30 days |
Not set |
30 days |
|
Not relevant |
30 days |
12 months |
30 days |
|
Not set |
30 days |
Not set |
30 days |
|
Considering our example, if we retrieve the lifecycle of my-data-stream:
resp = client.indices.get_data_lifecycle(
name="my-data-stream",
)
print(resp)
response = client.indices.get_data_lifecycle( name: 'my-data-stream' ) puts response
const response = await client.indices.getDataLifecycle({
name: "my-data-stream",
});
console.log(response);
GET _data_stream/my-data-stream/_lifecycle
We see that it will remain the same with what the user configured:
{
"global_retention" : {
"max_retention" : "90d",
"default_retention" : "7d"
},
"data_streams": [
{
"name": "my-data-stream",
"lifecycle": {
"enabled": true,
"data_retention": "30d",
"effective_retention": "30d",
"retention_determined_by": "data_stream_configuration"
}
}
]
}
|
The maximum retention configured in the cluster. |
|
|
The default retention configured in the cluster. |
|
|
The requested retention for this data stream. |
|
|
The retention that is applied by the data stream lifecycle on this data stream. |
|
|
The configuration that determined the effective retention. In this case it’s the |
How is the effective retention applied?
editRetention is applied to the remaining backing indices of a data stream as the last step of
a data stream lifecycle run. Data stream lifecycle will retrieve the backing indices
whose generation_time is longer than the effective retention period and delete them. The generation_time is only
applicable to rolled over backing indices and it is either the time since the backing index got rolled over, or the time
optionally configured in the index.lifecycle.origination_date setting.
We use the generation_time instead of the creation time because this ensures that all data in the backing
index have passed the retention period. As a result, the retention period is not the exact time data get deleted, but
the minimum time data will be stored.