etcd: Inconsistent revision and data occurs

I’m running a 3-member etcd cluster in a testing environment, and a k8s cluster with each kube-apiserver connecting to one etcd server via localhost.

etcd cluster: created and running with v3.5.1 k8s cluster: v1.22

I found a data inconsistent several days ago: Some keys exists on node x and y but not on node z, and some on z but not on the other two, e.g. different pods list would be returned from different servers. Some keys have different values between z and the others, and can be updated to another different value via the corresponding etcd endpoint, e.g. the kube-system/kube-controller-manager lease points to different pods on different servers and both pods can successfully update the lease via their corresponding local api-server and etcd. While other keys, including some new created ones, are consistent.

Checking with etcdctl endpoint status, raft_term, leader, raftIndex and raftAppliedIndex are same, but revision and dbSize does not: revision on node z is smaller than the others (~700000 diff) and both revisions are increasing.

Checking with etcd-dump-logs, it seems all the 3 nodes are receiving the same raft logs.

I’m not sure how and when this happens. But the nodes might be under loads and may run out of memory or disk spaces sometimes. Checking and comparing db keys with bbolt keys SNAPSHOT key, and seaching node operating system logs for the revisions near where the difference starts, I found some slow read logs mentions those revisions and dozon of leader elections and switches during those several hours. Besides, the disk space (including the etcd data dir) of node z may be full and the operating system logs was cleared so I don’t know what happeds exactly and I’m not sure whether this relates to the issue or not.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 45 (27 by maintainers)

Commits related to this issue

Most upvoted comments

I think I found reproduction, it was surprisingly easy when I knew where to look. Based on @PaulFurtado I looked into simulating highly stressed etcd and sending SIGKILL signal to members one by one. When looking into our functional tests I found it very strange that we don’t already have such test. We have tests with SIGTERM_* and SIGQUIT_AND_REMOVE_FOLLOWER, however we don’t just test if database is correctly restored after unrecoverable error.

I have added a new tests (SIGKILL_FOLLOWER, SIGKILL_LEADER) and increased the stress-qps. This was enough to cause data inconsistency. As functional tests run with --experimental-initial-corrupt-check, killed member fails to join with message checkInitialHashKV failed. This doesn’t answer question how in @PaulFurtado case this check was not triggered, but should be enough to show there is a problem.

To make results repeatable I have modified functional tests to inject the failure repeatably for some time. I managed to get 100% chance of reproduction for both test scenarios with 8000 qps within 1 minute of running. Issue seems to be only happen in higher qps scenarios, with lower chance of reproduction with 4000 qps and no reproductions with 2000 qps.

With that I looked into using same method to test v3.4.18 release. I didn’t managed to get any corruptions even when running for 10 minutes with 8000 qps. I didn’t test with higher qps as this is limit of my workstation, however this should be enough to confirm that issue is only on v3.5.X.

I’m sharing the code I used for reproduction here https://github.com/etcd-io/etcd/pull/13838. I will be looking to root causing the data inconsistency first, and later redesigning our functional tests as they seem to be not fulfilling their function.

This shows a flaw in current implementation of corruption check, we cannot verify hashes if members revisions differ to much. So if the corruption happen long before check was enabled we are not able to detect the corruption.

git bisect has lead me to https://github.com/etcd-io/etcd/commit/50051675f9740a4561e13bc5f00a89982b5202ad being the root cause. It’s not 100% sure as reproduction has some flakiness. PR with the commit https://github.com/etcd-io/etcd/pull/12855

better corruption check

Yes, see https://github.com/etcd-io/etcd/issues/13839