Multi data path bug in Elasticsearch 5.3.0
If you use a custom data path with Elasticsearch 5.3.0, you may be subject to a bug which could cause data loss unless properly handled.
The bug is triggered as follows:
-
default.path.datais configured on the command line — it is configured by default in the RPM and Debian packages. -
path.datais configured in theelasticsearch.ymlfile as an array containing one or more paths.
The default.path.data command line setting is used to tell Elasticsearch
which default data path to use unless
path.data is configured either in the
config file or on the command line. The bug occurs because
path.data, when
specified as an array, is merged with
default.path.data instead of replacing
it.
First, this bug affects Elasticsearch 5.3.0 only. You can tell if you are
affected by comparing the expected list of data paths with those returned by
the
_nodes API.
For example, imagine your elasticsearch.yml file contains something like
the following:
path.data:
- /mnt/path_1
- /mnt/path_2
- /mnt/path_3
or
path.data: [ /mnt/path_1, /mnt/path_2, /mnt/path_3 ]
Retrieve the path settings for all nodes with the following request:
curl -XGET "http://localhost:9200/_nodes?pretty&filter_path=nodes.*.settings.path.data"
which returns a response like this:
{
"nodes": {
"GrrMUWcCTlKprhcvROUIoQ": {
"settings": {
"path": {
"data": [
"/var/lib/elasticsearch",
"/mnt/path_1",
"/mnt/path_2",
"/mnt/path_3"
]
}
}
}
}
}
You see the extra /var/lib/elasticsearch entry? That is coming from
default.path.data. If an extra entry is present, then you are affected and
need to take action.
The impact of the bug
There are two possible outcomes from this bug.
-
When multiple data paths are specified, Elasticsearch allocates each shard to
one of the data paths. This means that you may have one or more shards
located in the path specified in
default.path.data. -
If you are running more than one node on a single machine, the second node may
refuse to start because the
default.path.datapath is already locked by the first node.
This fix can be applied node-by-node with rolling restarts. It does not require a full cluster restart. Do not try to apply this fix on a running node.
Stop the first Elasticsearch node, then:
Change your path.data configuration to use a comma-separated string instead
of an array, as shown in this example:
path.data: /mnt/path_1,/mnt/path_2,/mnt/path_3
This will overwrite the default.path.data setting instead of merging settings.
Next, you will need to move any data from the path specified in default.path.data to one
or more of the other data paths.
For instance, assuming your default.path.data is set to /var/lib/elasticsearch,
you will need to copy any data in that path to one of the other configured
paths:
/mnt/path_1, /mnt/path_2, or /mnt/path_3.
If one of the other paths has sufficient space to hold all of the contents of
/var/lib/elasticsearch then you can copy all the data to a single path as
follows:
cp -vr /var/lib/elasticsearch/ /mnt/path_1/
The trailing / are important!
If there is too much data in /var/lib/elasticsearch to fit in a single path,
then you can copy individual indices to different paths.
First, list the indices in /var/lib/elasticsearch:
ls /var/lib/elasticsearch/nodes/0/indices/
In our example there are three indices which need to be copied:
lL6xLDIrSfSqysFp-fnk8g/ lW0CidrcR9aIBYdI-wbyBg/ n6EnZ0MMSMmSVl4ktoX9ig/
To copy the lL6xLDIrSfSqysFp-fnk8g/ index to /mnt/path_1, we need to create the correct path in case it doesn’t already exist:
mkdir -p /mnt/path_1/nodes/0/indices/lL6xLDIrSfSqysFp-fnk8g/
Then copy the index:
cp -vr /var/lib/elasticsearch/nodes/0/indices/lL6xLDIrSfSqysFp-fnk8g/ /mnt/path_1/nodes/0/indices/lL6xLDIrSfSqysFp-fnk8g/
Repeat this process for all remaining indices.
Finally, restart the Elasticsearch node and check that the path.data
settings for this node are correct, using the same _nodes request as above:
curl -XGET "http://localhost:9200/_nodes?pretty&filter_path=nodes.*.settings.path.data"
Check the cluster health and make sure that the status is either yellow or green:
curl -XGET "http://localhost:9200/_cat/health?v"
If the status is yellow, wait for it to turn green before continuing this
same process with the next node. Once cluster health has returned to
green,
you can delete the contents of
/var/lib/elasticsearch.
If the status is red, then you have forgotten to copy some data from
default.path.data. You can check which shards are not recovering correctly using the _cat/shards API:
curl -XGET "http://localhost:9200/_cat/shards?v"
Elasticsearch 5.3.1 and above will come with a bug fix for this bad
configuration merging. It will also check whether you may have suffered from
this bug in the past by looking at your current settings and the contents of
the
default.path.data path to see whether it contains any shard data.
If it finds data there, the node will refuse to start.
To solve this issue, you will need to make sure that the path listed in
default.path.data is empty. To be on the safe side, either rename the
directory or move the data to a new directory rather than just deleting the
directory. Once your cluster is
green, you can safely delete the backup copy.