Note If you follow my process, you are pretty much guaranteed to lose data. Please think carefully before running any commands from this page.
A few days ago Elasticsearch died on one of my servers due to a lack of memory - One of my Python scripts interacting with a headless Chrome instance forgot a close a few tabs… a lesson for another day.
So, a few pkill
s & a systemctl start elasticsearch
later, my node was back up and running, but he wasn’t looking too healthy - I had an unassigned shard for one of my indices.
Running curl -XGET http://localhost:9200/_cluster/allocation/explain?pretty
gave me the following:
{
"index" : "myindexname",
"shard" : 2,
"primary" : true,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "ALLOCATION_FAILED",
"at" : "2018-05-21T14:23:38.916Z",
"failed_allocation_attempts" : 1,
"details" : "failed recovery, failure RecoveryFailedException[[myindexname][2]: Recovery failed on {irybXIX}{irybXIXBRQ6w3XDiBeIQxA}{Va5KI3wzTAyw6ziIDKa1jQ}{111.222.333.444}{111.222.333.444:9300}]; nested: IndexShardRecoveryException[failed to fetch index version after copying it over]; nested: CorruptIndexException[failed engine (reason: [merge failed]) (resource=preexisting_corruption)]; nested: IOException[failed engine (reason: [merge failed])]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=b3b6ae3d actual=18b61dd7 (resource=BufferedChecksumIndexInput(MMapIndexInput(path=\"/var/lib/elasticsearch/elasticsearch/nodes/0/indices/eUZzli7PQdynUV89X-A7Dw/2/index/_3p5v.cfs\") [slice=_3p5v_Lucene50_0.pos]))]; ",
"last_allocation_status" : "no_valid_shard_copy"
},
"can_allocate" : "no_valid_shard_copy",
"allocate_explanation" : "cannot allocate because all found copies of the shard are either stale or corrupt",
"node_allocation_decisions" : [
{
"node_id" : "irybXIXBRQ6w3XDiBeIQxA",
"node_name" : "irybXIX",
"transport_address" : "111.222.333.444:9300",
"node_decision" : "no",
"store" : {
"in_sync" : true,
"allocation_id" : "Qh7gzK-aR0iq7esB0ztmEQ",
"store_exception" : {
"type" : "corrupt_index_exception",
"reason" : "failed engine (reason: [merge failed]) (resource=preexisting_corruption)",
"caused_by" : {
"type" : "i_o_exception",
"reason" : "failed engine (reason: [merge failed])",
"caused_by" : {
"type" : "corrupt_index_exception",
"reason" : "checksum failed (hardware problem?) : expected=b3b6ae3d actual=18b61dd7 (resource=BufferedChecksumIndexInput(MMapIndexInput(path=\"/var/lib/elasticsearch/elasticsearch/nodes/0/indices/eUZzli7PQdynUV89X-A7Dw/2/index/_3p5v.cfs\") [slice=_3p5v_Lucene50_0.pos]))"
}
}
}
}
}
]
}
So, the checksum on one of my shards doesn’t match what Elasticsearch/Lucene is expecting, so either my HDD is on its way out (plausible) or Elasticsearch running out of memory meant it didn’t shutdown cleanly.
A quick rummage through Google gives us a post by Jilles van Gurp who follows some similar steps but on an older (1.x maybe?) version of Elasticsearch. Here, Jilles is using the Lucene libraries packed with Elasticsearch to check the index (and also fix any errors it finds, should we want it to).
# First, move to the elasticsearch directory
cd /usr/share/elasticsearch/lib
# We then run CheckIndex on our shard.
# java -cp lucene-core*.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex {dir of the shard}
java -cp lucene-core*.jar -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /var/lib/elasticsearch/elasticsearch/nodes/0/indices/eUZzli7PQdynUV89X-A7Dw/2/index/
In my instance, it found a few corrupt entries which I was willing to sacrifice, in order to get the rest of the index back online. So, we can re-run the command with -exorcise
(previously -fix) to tell CheckIndex to fix the issues.
If it fails to run, chances are it is due to a permissions error. Running it with sudo
works, but in retrospect, it might be better to run it as elasticsearch
using su - elasticsearch -c "command"
At this point, I continued to struggle to get the damn shard to join its friends, so I continued my hunt. I considered rewriting the checksum for the shard to what Elasticsearch was expecting, but decided against that.
I then spotted the file corrupted_OC4_2JgCQrW_8MjrZgpDVQ
in my shard index directory and wondered if that was telling Elasticsearch to not even bother trying to initialise the shard because it’s corrupt. A swift rm
later and we’re back on the right path!
I restarted my elasticsearch instance again and watched as the shards came online and begin getting assigned with curl -XGET localhost:9200/_cat/shards?v
. In another window I was also running tail -f /var/log/elasticsearch/elasticsearch.log
, and then something caught my eye, a new error message.
[2018-05-21T14:37:30,840][DEBUG][o.e.a.a.c.a.TransportClusterAllocationExplainAction] [irybXIX] explaining the allocation for [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false], found shard [[myindexname][2], node[null], [P], recovery_source[existing recovery], s[UNASSIGNED], unassigned_info[[reason=ALLOCATION_FAILED], at[2018-05-21T14:37:08.056Z], failed_attempts[1], delayed=false, details[failed recovery, failure RecoveryFailedException[[myindexname][2]: Recovery failed on {irybXIX}{irybXIXBRQ6w3XDiBeIQxA}{RXiDt2hRTDe_DND9gST7qQ}{111.222.333.444}{111.222.333.444:9300}]; nested: IndexShardRecoveryException[failed recovery]; nested: TranslogException[failed to create new translog file]; nested: AccessDeniedException[/var/lib/elasticsearch/elasticsearch/nodes/0/indices/eUZzli7PQdynUV89X-A7Dw/2/translog/translog.ckp]; ], allocation_status[deciders_throttled]]]
A different error! AccessDeniedException
suggests right away that our use of sudo
earlier has created some elasticsearch files as root
, so a quick chmod
and we are back to normal:
/var/lib/elasticsearch/elasticsearch/nodes/0/indices/eUZzli7PQdynUV89X-A7Dw/2/index $ ls -la
total 7156392
drwxr-xr-x 2 elasticsearch elasticsearch 12288 May 21 14:37 .
drwxr-xr-x 5 elasticsearch elasticsearch 4096 May 21 14:31 ..
-rw-r--r-- 1 elasticsearch elasticsearch 15548 May 17 18:17 _2laq_69.liv
[... TRIM ...]
-rw-r--r-- 1 elasticsearch elasticsearch 11209 May 18 00:38 _3p6n.cfs
-rw-r--r-- 1 elasticsearch elasticsearch 396 May 18 00:38 _3p6n.si
-rw-r--r-- 1 root root 2086 May 21 12:13 segments_1hio
-rw-r--r-- 1 elasticsearch elasticsearch 0 Apr 28 2017 write.lock
sudo chmod elasticsearch:elasticsearch *
Bingo, segments_1hio
is owned as root
rather than elasticsearch
. Before we chmod
that, lets see if there is anything else we need to fix:
/var/lib/elasticsearch/elasticsearch/nodes/0/indices/eUZzli7PQdynUV89X-A7Dw/2/translog $ ls -la
total 24
drwxr-xr-x 2 elasticsearch elasticsearch 4096 May 21 14:37 .
drwxr-xr-x 5 elasticsearch elasticsearch 4096 May 21 14:31 ..
-rw-r--r-- 1 elasticsearch elasticsearch 48 May 21 14:37 translog-32730.ckp
-rw-r--r-- 1 root root 43 May 21 13:52 translog-32730.tlog
-rw-r--r-- 1 elasticsearch elasticsearch 43 May 21 14:37 translog-32731.tlog
-rw-r--r-- 1 root root 48 May 21 13:52 translog.ckp
sudo chmod elasticsearch:elasticsearch *
Yup, and you translog
. Simply running sudo chown elasticsearch:elasticsearch *
in both directories fixes the issue, and elasticsearch starts initialising our fixed shard!
curl -XGET localhost:9200/_cat/shards?v | grep myindexname
myindexname 3 p STARTED 175380 6.4gb 111.222.333.444 irybXIX
myindexname 4 p STARTED 175704 6.4gb 111.222.333.444 irybXIX
myindexname 2 p INITIALIZING 111.222.333.444 irybXIX
myindexname 1 p STARTED 174592 6.6gb 111.222.333.444 irybXIX
myindexname 0 p STARTED 175485 6.6gb 111.222.333.444 irybXIX
Fixed!