Multi data path bug in Elasticsearch 5.3.0
If you use a custom data path with Elasticsearch 5.3.0, you may be subject to a bug which could cause data loss unless properly handled.
The bug is triggered as follows:
default.path.data
is configured on the command line — it is configured by default in the RPM and Debian packages.path.data
is configured in theelasticsearch.yml
file as an array containing one or more paths.
The default.path.data
command line setting is used to tell Elasticsearch
which default data path to use unless
path.data
is configured either in the
config file or on the command line. The bug occurs because
path.data
, when
specified as an array, is merged with
default.path.data
instead of replacing
it.
First, this bug affects Elasticsearch 5.3.0 only. You can tell if you are
affected by comparing the expected list of data paths with those returned by
the
_nodes
API.
For example, imagine your elasticsearch.yml
file contains something like
the following:
path.data: - /mnt/path_1 - /mnt/path_2 - /mnt/path_3
or
path.data: [ /mnt/path_1, /mnt/path_2, /mnt/path_3 ]
Retrieve the path settings for all nodes with the following request:
curl -XGET "http://localhost:9200/_nodes?pretty&filter_path=nodes.*.settings.path.data"
which returns a response like this:
{ "nodes": { "GrrMUWcCTlKprhcvROUIoQ": { "settings": { "path": { "data": [ "/var/lib/elasticsearch", "/mnt/path_1", "/mnt/path_2", "/mnt/path_3" ] } } } } }
You see the extra /var/lib/elasticsearch
entry? That is coming from
default.path.data
. If an extra entry is present, then you are affected and
need to take action.
The impact of the bug
There are two possible outcomes from this bug.
-
When multiple data paths are specified, Elasticsearch allocates each shard to
one of the data paths. This means that you may have one or more shards
located in the path specified in
default.path.data
. -
If you are running more than one node on a single machine, the second node may
refuse to start because the
default.path.data
path is already locked by the first node.
This fix can be applied node-by-node with rolling restarts. It does not require a full cluster restart. Do not try to apply this fix on a running node.
Stop the first Elasticsearch node, then:
Change your path.data
configuration to use a comma-separated string instead
of an array, as shown in this example:
path.data: /mnt/path_1,/mnt/path_2,/mnt/path_3
This will overwrite the default.path.data
setting instead of merging settings.
Next, you will need to move any data from the path specified in default.path.data
to one
or more of the other data paths.
For instance, assuming your default.path.data
is set to /var/lib/elasticsearch
,
you will need to copy any data in that path to one of the other configured
paths:
/mnt/path_1
, /mnt/path_2
, or /mnt/path_3
.
If one of the other paths has sufficient space to hold all of the contents of
/var/lib/elasticsearch
then you can copy all the data to a single path as
follows:
cp -vr /var/lib/elasticsearch/ /mnt/path_1/
The trailing /
are important!
If there is too much data in /var/lib/elasticsearch
to fit in a single path,
then you can copy individual indices to different paths.
First, list the indices in /var/lib/elasticsearch
:
ls /var/lib/elasticsearch/nodes/0/indices/
In our example there are three indices which need to be copied:
lL6xLDIrSfSqysFp-fnk8g/ lW0CidrcR9aIBYdI-wbyBg/ n6EnZ0MMSMmSVl4ktoX9ig/
To copy the lL6xLDIrSfSqysFp-fnk8g/
index to /mnt/path_1
, we need to create the correct path in case it doesn’t already exist:
mkdir -p /mnt/path_1/nodes/0/indices/lL6xLDIrSfSqysFp-fnk8g/
Then copy the index:
cp -vr /var/lib/elasticsearch/nodes/0/indices/lL6xLDIrSfSqysFp-fnk8g/ /mnt/path_1/nodes/0/indices/lL6xLDIrSfSqysFp-fnk8g/
Repeat this process for all remaining indices.
Finally, restart the Elasticsearch node and check that the path.data
settings for this node are correct, using the same _nodes
request as above:
curl -XGET "http://localhost:9200/_nodes?pretty&filter_path=nodes.*.settings.path.data"
Check the cluster health and make sure that the status is either yellow
or green
:
curl -XGET "http://localhost:9200/_cat/health?v"
If the status is yellow
, wait for it to turn green
before continuing this
same process with the next node. Once cluster health has returned to
green
,
you can delete the contents of
/var/lib/elasticsearch
.
If the status
is red
, then you have forgotten to copy some data from
default.path.data
. You can check which shards are not recovering correctly using the _cat/shards
API:
curl -XGET "http://localhost:9200/_cat/shards?v"
Elasticsearch 5.3.1 and above will come with a bug fix for this bad
configuration merging. It will also check whether you may have suffered from
this bug in the past by looking at your current settings and the contents of
the
default.path.data
path to see whether it contains any shard data.
If it finds data there, the node will refuse to start.
To solve this issue, you will need to make sure that the path listed in
default.path.data
is empty. To be on the safe side, either rename the
directory or move the data to a new directory rather than just deleting the
directory. Once your cluster is
green
, you can safely delete the backup copy.