etcd: ETCD fails to start after performing alarm list operation and then power off/on

What happened?

With a cluster of three nodes setup using https://github.com/bitnami/charts/blob/master/bitnami/etcd/README.md it was noticed that the ETCD nodes failed to start after a forced reboot of the underlying worker nodes. A graceful shutdown will not result in this issue.

The logs indicated a mismatch in the raft log index between the v2 *.snap files and the v3 db file where the index of the snap files was higher than that of the v3 db file causing ETCD to look for a snap.db file that did not exist (see logs).

The index of the snap file was derived from the file name (eg. 0000000000000017-0000000000124f8c.snap) while the consistent_index of the v3 db was extracted using bbolt, bbolt get db meta consistent_index | hexdump => 0xb4903.

So far the issue looked very much like what is described in #11949. The “fix” described in that issue to get the cluster up and running again also worked, to remove/move the *.snap files.

Worth mentioning: This cluster had not had any writes to it for serveral weeks ahead of the reboot. The data in it is mostly read. Doing a proper write to the cluster will set the consistent_index of the v3 DB to an up-to-date value of the raft index.

After some investigation into why this index difference the between the snapshots and the v3 store occurred it was found that the health check executed regularly by Kubernetes was the reason for the version drift.

The health and readiness check regularly executes etcdctl endpoint health to determine if the cluster is healthy or not. In ETCD 3.4 this was a simple GET on the health key but since https://github.com/etcd-io/etcd/pull/12150 it also includes checking the alarm list to verify that it is empty. For some reason listing the alarms also triggers a write/apply (see attached logs). And for some reason this apply is only applied to the V2 store, not the V3 store. This cause the index in the V2 store to drift from the V3 store until a proper write is performed. I have not dug into the reason for why the write is performed and why it is missing from the V3 store.

The behaviour is only present in this form in 3.5 since the health check in 3.4 does not include listing the alarms.

The problem is easy to reproduce locally. See description.

What did you expect to happen?

I would always expect ETCD to be able to start properly regardless of how the shutdown was done.

How can we reproduce it (as minimally and precisely as possible)?

Locally:

# Use very low snapshot count to more quickly trigger the issue
$  ./etcd --snapshot-count=5 --log-level=debug

# In a different terminal, repeat six times to trigger creation of a snapshot file (you'll see a `triggering snapshot` log message in the server log when this happens)
$ ./etcdctl endpoint health
$ ./etcdctl endpoint health
$ ./etcdctl endpoint health
$ ./etcdctl endpoint health
$ ./etcdctl endpoint health
$ ./etcdctl endpoint health

# Kill etcd server process
$ kill -9 <etcd PID>

# Try to start ETCD again, this will fail with the attached log message
$ ./etcd --snapshot-count=5 --log-level=debug

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.4
Git SHA: 08407ff76
Go Version: go1.16.15
Go OS/Arch: darwin/amd64

$ etcdctl version
etcdctl version: 3.5.4
API version: 3.5

Etcd configuration (command line flags or environment variables)

See steps to reproduce

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

See steps to reproduce

Relevant log output

# Log printed upon startup failure
{"level":"info","ts":"2022-07-23T18:21:35.360Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/bitnami/etcd/data/member/snap/db","took":"174.548µs"}
{"level":"info","ts":"2022-07-23T18:21:36.259Z","caller":"etcdserver/server.go:508","msg":"recovered v2 store from snapshot","snapshot-index":1200012,"snapshot-size":"25 kB"}
{"level":"warn","ts":"2022-07-23T18:21:36.260Z","caller":"snap/db.go:88","msg":"failed to find [SNAPSHOT-INDEX].snap.db","snapshot-index":1200012,"snapshot-file-path":"/bitnami/etcd/data/member/snap/0000000000124f8c.snap.db","error":"snap: snapshot file doesn't exist"}
{"level":"panic","ts":"2022-07-23T18:21:36.261Z","caller":"etcdserver/server.go:515","msg":"failed to recover v3 backend from snapshot","error":"failed to find database snapshot file (snap: snapshot file doesn't exist)","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.NewServer\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:515\ngo.etcd.io/etcd/server/v3/embed.StartEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcd\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228\ngo.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}
panic: failed to recover v3 backend from snapshot
goroutine 1 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000506300, 0xc000532f40, 0x1, 0x1)
/go/pkg/mod/go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x58d
go.uber.org/zap.(*Logger).Panic(0xc000010fa0, 0x1234726, 0x2a, 0xc000532f40, 0x1, 0x1)
/go/pkg/mod/go.uber.org/zap@v1.17.0/logger.go:227 +0x85
go.etcd.io/etcd/server/v3/etcdserver.NewServer(0xc0000460ca, 0xd, 0x0, 0x0, 0x0, 0x0, 0xc0005150e0, 0x2, 0x2, 0xc00053c480, ...)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdserver/server.go:515 +0x1656
go.etcd.io/etcd/server/v3/embed.StartEtcd(0xc000104000, 0xc000104600, 0x0, 0x0)
/go/src/go.etcd.io/etcd/release/etcd/server/embed/etcd.go:245 +0xef8
go.etcd.io/etcd/server/v3/etcdmain.startEtcd(0xc000104000, 0x12089be, 0x6, 0xc0004b7201, 0x2)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:228 +0x32
go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2(0xc00003a1f0, 0x1, 0x1)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:123 +0x257a
go.etcd.io/etcd/server/v3/etcdmain.Main(0xc00003a1f0, 0x1, 0x1)
/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40 +0x13f


# Server debug logs when running `etcdctl endpoint health`
{"level":"debug","ts":"2022-08-25T15:21:32.425+0200","caller":"v3rpc/interceptor.go:182","msg":"request stats","start time":"2022-08-25T15:21:32.425+0200","time spent":"114.504µs","remote":"127.0.0.1:51728","response type":"/etcdserverpb.KV/Range","request count":0,"request size":8,"response count":0,"response size":28,"request content":"key:\"health\" "}
{"level":"debug","ts":"2022-08-25T15:21:32.426+0200","caller":"etcdserver/server.go:2122","msg":"Applying entries","num-entries":1}
{"level":"debug","ts":"2022-08-25T15:21:32.426+0200","caller":"etcdserver/server.go:2125","msg":"Applying entry","index":70,"term":5,"type":"EntryNormal"}
{"level":"debug","ts":"2022-08-25T15:21:32.426+0200","caller":"etcdserver/server.go:2184","msg":"apply entry normal","consistent-index":60,"entry-index":70,"should-applyV3":true}
{"level":"debug","ts":"2022-08-25T15:21:32.426+0200","caller":"etcdserver/server.go:2211","msg":"applyEntryNormal","raftReq":"header:<ID:7587864799024601626 > alarm:<> "}
{"level":"debug","ts":"2022-08-25T15:21:32.426+0200","caller":"v3rpc/interceptor.go:182","msg":"request stats","start time":"2022-08-25T15:21:32.426+0200","time spent":"241.967µs","remote":"127.0.0.1:51728","response type":"/etcdserverpb.Maintenance/Alarm","request count":-1,"request size":-1,"response count":-1,"response size":-1,"request content":""}
{"level":"info","ts":"2022-08-25T15:21:32.428+0200","caller":"zapgrpc/zapgrpc.go:174","msg":"[transport] transport: loopyWriter.run returning. connection error: desc = \"transport is closing\""}

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 17 (11 by maintainers)

Commits related to this issue

Update changelog to cover some PRs Issues: 1. https://github.com/etcd-io/etcd/issues/14402 fixed in 3.4 only; 2. https://github.com/etcd-io/etcd/issues/14382 fixed in both 3.5 and main. Signed-off-b... — committed to ahrtr/etcd by ahrtr 2 years ago

Most upvoted comments

In case anyone is interested, this is the workaround solution https://github.com/ahrtr/etcd-issues/blob/d134cb8d07425bf3bf530e6bb509c6e6bc6e7c67/etcd/etcd-db-editor/main.go#L16-L28

+10

ahrtr on Sep 22, 2022

In case anyone is interested, this is the workaround solution https://github.com/ahrtr/etcd-issues/blob/d134cb8d07425bf3bf530e6bb509c6e6bc6e7c67/etcd/etcd-db-editor/main.go#L16-L28

This solution fixed our problem in a production etcd configuration. Just commenting here so more people can follow. Thanks!

wcmjunior on Oct 10, 2022