etcd: etcd server DB out of sync undetected

I run 3 etcd server instances and I frequently observe that the data bases get out of sync without the etcd cluster detecting a cluster health issue. Repeated get-requests of a key will return different (versions of) values depending on which server the local proxy queries.

The only way to detect this problem is to compare the hashkv of all endpoints. All other health checks return “healthy”. A typical status check looks like this (using v3 api):

# etcdctl endpoint health --cluster:

http://ceph-03:2379 is healthy: successfully committed proposal: took = 10.594707ms
http://ceph-02:2379 is healthy: successfully committed proposal: took = 942.709µs
http://ceph-01:2379 is healthy: successfully committed proposal: took = 857.871µs

# etcdctl endpoint status --cluster -w table:
+---------------------+------------------+---------+---------+-----------+-----------+------------+
|      ENDPOINT       |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+---------------------+------------------+---------+---------+-----------+-----------+------------+
| http://ceph-03:2379 | e01b8ec12d1d3b22 |  3.3.11 |  635 kB |      true |       305 |      16942 |
| http://ceph-01:2379 | e13cf5b0920c4769 |  3.3.11 |  627 kB |     false |       305 |      16943 |
| http://ceph-02:2379 | fd3d871bd684ee85 |  3.3.11 |  631 kB |     false |       305 |      16944 |
+---------------------+------------------+---------+---------+-----------+-----------+------------+

# etcdctl endpoint hashkv --cluster:
http://ceph-03:2379, 2236472732
http://ceph-01:2379, 1950304164
http://ceph-02:2379, 494595250

I couldn’t figure out how to check the actual cluster health with etcdctl. All I seem to be able to do is check endpoint health, but the implication “all end points healthy” -> “cluster is healthy” does clearly not hold.

The cluster is running etcd 3.3.11 on Centos 7.7 with the stock packages. I attached a number of files:

etcd.info.txt - some info collected according to instructions. etcd-ceph-01.conf.txt, etcd-ceph-02.conf.txt, etcd-ceph-03.conf.txt - the 3 config files of the etcd members etcd-gnosis.conf.txt - a config file for a client using the etcd proxy service etcd.log - the result of grep -e "etcd:" /var/log/messages, the etcd log goes to syslog; this log should cover at least one occasion of loss of DB consistency

I cannot attach the data bases, because they contain credentials. However, I can - for some time - run commands on the data bases and post results. I have a little bit of time before I need to synchronize the servers again.

In the mean time, I could use some help with recovering. This is the first time all 3 instances are different. Usually, I have only 1 out-of-sync server and can get back to normal by removing and re-adding it. However, here I have now 3 different instances and it is no longer trivial to decide which copy is the latest. Therefore, if you could help me with these questions, I would be grateful:

  1. How can I print all keys with current revision number that a member holds?
  2. Is there a way to force the members to sync?

Thanks for your help.

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Comments: 28 (14 by maintainers)

Most upvoted comments

@jonaz aws team faced the same data inconsistency problem as well. After log diving, it appeared the lease_revoke apply failed due to leaseID not found in one node. That node mvcc current revision is the smallest one compared with the other two nodes. The other two node log did not have the lease_revoke failure for this leaseID.

Under this condition, the lessor Revoke will fail to delete the associated key values cause it will error out in Line 314. So the revision will start to diverge at this point. https://github.com/etcd-io/etcd/blob/a905430d27ec7372267b1cf193f6aa6cda68adb6/lease/lessor.go#L308-L332

Due to the kube-apiserver usage (specifically optimistic lock on the ResourceVersion) of etcd Txn like the following, the revision difference will be amplified and cause more serious cascading failures like kube-scheduler and kube-controll-er-manager cannot acquire or update its endpoint lease.

txnResp, err := s.client.KV.Txn(ctx).If(
	clientv3.Compare(clientv3.ModRevision(key), "=", origState.rev),
).Then(
	clientv3.OpPut(key, string(newData), opts...),
).Else(
	clientv3.OpGet(key),
).Commit()
metrics.RecordEtcdRequestLatency("update", getTypeName(out), startTime)
if err != nil {
	return err
}
trace.Step("Transaction committed")
if !txnResp.Succeeded {
	getResp := (*clientv3.GetResponse)(txnResp.Responses[0].GetResponseRange())
	klog.V(4).Infof("GuaranteedUpdate of %s failed because of a conflict, going to retry", key)
	origState, err = s.getState(getResp, key, v, ignoreNotFound)
	if err != nil {
		return err
	}
	trace.Step("Retry value restored")
	mustCheckData = false
	continue
}

https://github.com/kubernetes/kubernetes/blob/release-1.22/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go#L422-L436


From your log post in the previous comment

failed to apply request "header:<ID:9133152155410395282 > lease_revoke:<id:-6f4bd77510edab61>" with response "size:27" took (69.202µs) to execute, err is lease not found

The lease ID -6f4bd77510edab61 technically will never be negative because etcd server guaranteed leaseID must be positive when processing LeaseGrant requests. https://github.com/etcd-io/etcd/blob/a905430d27ec7372267b1cf193f6aa6cda68adb6/etcdserver/v3_server.go#L247-L258


For this “bad” etcd cluster, we also dumped the db file, inspected the lease bucket and found out there were multiple corrupted leases.

etcd-dump-db iterate-bucket /mnt/xvdb/datadir/member/snap/db lease --limit 3 --decode=true
lease ID=6eae7bcbcf28f962, TTL=15s
lease ID=6eae7bcbcf1cbf04, TTL=15s
panic: proto: illegal wireType 6

goroutine 1 [running]:
main.leaseDecoder(0x7f088db0caae, 0x8, 0x8, 0x7f088db0cab6, 0xd, 0xd)
	/home/ec2-user/go.etcd.io/etcd/tools/etcd-dump-db/backend.go:95 +0x195
main.iterateBucket.func1(0xc0000cc2a0, 0xc000000100, 0xc0000cc2a0)
	/home/ec2-user/go.etcd.io/etcd/tools/etcd-dump-db/backend.go:147 +0x373
go.etcd.io/bbolt.(*DB).View(0xc000118200, 0xc000121c10, 0x0, 0x0)
	/root/go/pkg/mod/go.etcd.io/bbolt@v1.3.3/db.go:725 +0x96
main.iterateBucket(0x7ffc733fc757, 0x20, 0x7ffc733fc778, 0x5, 0x1, 0x1, 0x0, 0x0)
	/home/ec2-user/go.etcd.io/etcd/tools/etcd-dump-db/backend.go:134 +0x1e5
main.iterateBucketCommandFunc(0xdd62a0, 0xc000080e10, 0x2, 0x5)
	/home/ec2-user/go.etcd.io/etcd/tools/etcd-dump-db/main.go:104 +0x11a
github.com/spf13/cobra.(*Command).execute(0xdd62a0, 0xc000080dc0, 0x5, 0x5, 0xdd62a0, 0xc000080dc0)
	/root/go/pkg/mod/github.com/spf13/cobra@v0.0.3/command.go:766 +0x2c2
github.com/spf13/cobra.(*Command).ExecuteC(0xdd5de0, 0xa37d84, 0xc000121f78, 0x407d65)
	/root/go/pkg/mod/github.com/spf13/cobra@v0.0.3/command.go:852 +0x2fe
github.com/spf13/cobra.(*Command).Execute(...)
	/root/go/pkg/mod/github.com/spf13/cobra@v0.0.3/command.go:800
main.main()
	/home/ec2-user/go.etcd.io/etcd/tools/etcd-dump-db/main.go:65 +0x31

After editing the etcd-dump-db to print the error instead of panic right away. It shows the lease ID (bolt key) is negative integer and value is not compatible with the lease proto.

etcd-dump-db iterate-bucket i-0185616eef78879ae-2022-01-30T09-05-00.etcd/member/snap/db lease --decode=true
lease ID=6eae7bcbcf28f962, TTL=15s, RemainingTTL=0s
lease ID=6eae7bcbcf1cbf04, TTL=15s, RemainingTTL=0s
proto: illegal wireType 6
lease ID=-28912891f76d7011, TTL=0s, RemainingTTL=0s
lease ID=-7430661f711f712, TTL=0s, RemainingTTL=0s
unexpected EOF
lease ID=7bcbcf1bc75308d3, TTL=0s, RemainingTTL=0s
lease ID=6eae7bcbd230ab72, TTL=15s, RemainingTTL=0s

Interestingly, the corrupted leases (lease ID < 0) will be ignored when recovering lessor from the db file. So the key values associated those corrupted leases will never be deleted. https://github.com/etcd-io/etcd/blob/a905430d27ec7372267b1cf193f6aa6cda68adb6/lease/lessor.go#L770-L776

/cc @wilsonwang371 @ptabor Does the above explanation make sense to you?

Run a simulation script in a 1.21 kubernetes cluster.

curl -X POST -sL http://localhost:2379/v3beta/lease/grant -d '{"TTL": 360, "ID": -1}'
echo ""
etcdctl put foo bar --lease="-1" -w json
etcdctl get foo -w json

curl -X POST -sL http://localhost:2379/v3beta/lease/grant -d '{"TTL": 360, "ID": -2}'
echo ""
etcdctl put bar foo --lease="-2" -w json
etcdctl get bar -w json

curl -X POST -sL http://localhost:2379/v3beta/lease/grant -d '{"TTL": 360, "ID": -3}'
echo ""
etcdctl put hello world --lease="-3" -w json
etcdctl get hello -w json

curl -X POST -sL http://localhost:2379/v3beta/lease/grant -d '{"TTL": 360, "ID": -4}'
echo ""
etcdctl put world hello --lease="-4" -w json
etcdctl get world -w json

Followed by Remove member and Add member runtime reconfiguration process one by one with 3 times.

The revision gap across 3 etcd nodes became larger and larger.

Screen Shot 2022-02-08 at 5 04 30 PM

The kube-controller-manager in kubernetes control plane node cannot acquire and renew its lease. The state of world and spec is out of sync over time.

E0209 00:43:37.322749      10 leaderelection.go:361] Failed to update lock: Operation cannot be fulfilled on leases.coordination.k8s.io "kube-controller-manager": the object has been modified; please apply your changes to the latest version and try again

@wilsonwang371 Hi wilson, we are using etcd version 3.4.18. release-3.4 don’t have the issue #13505 was trying to fix.

https://github.com/etcd-io/etcd/blob/72d3e382e73c1a2a4781f884fb64792af3242f22/mvcc/watchable_store.go#L351-L363

xref. https://github.com/etcd-io/etcd/blob/v3.4.18/mvcc/watchable_store.go#L351-L363

For my education, is the lease bucket data corruption a victim of deep copying during boltdb re-mmap? How will read impact the write path?

No, actually I am not sure about this yet. The root cause of our internal observation of DB out of sync is due to https://github.com/etcd-io/etcd/pull/13505

Hi @jonaz, we use a standard set-up and don’t do anything special. Leaders are elected by whatever etcd has implemented.

It looks like the devs are not going to address this. I plan to move away from etcd, it simply is too unreliable with this simple health-check and self-healing missing. Even if I purposefully corrupted the DB, it should detect and, if possible, fix it. However, it doesn’t even detect an accidental corruption and with the lack of interest to fix this, I cannot rely on it.

The same applies to your situation. It is completely irrelevant how you manage to throw an etcd instance off. The etcd cluster should detect and fix the problem or err out. However, it simply continues to answer requests and happily distributes inconsistent data. That’s not acceptable for a distributed data base.

If the idea of the devs is that a user adds a cron job for health checks herself, then the least that should be available is a command to force a re-sync of an instance against the others without the pain of removing and re-adding an instance.