etcd: etcd server DB out of sync undetected
I run 3 etcd server instances and I frequently observe that the data bases get out of sync without the etcd cluster detecting a cluster health issue. Repeated get
-requests of a key will return different (versions of) values depending on which server the local proxy queries.
The only way to detect this problem is to compare the hashkv of all endpoints. All other health checks return “healthy”. A typical status check looks like this (using v3 api):
# etcdctl endpoint health --cluster:
http://ceph-03:2379 is healthy: successfully committed proposal: took = 10.594707ms
http://ceph-02:2379 is healthy: successfully committed proposal: took = 942.709µs
http://ceph-01:2379 is healthy: successfully committed proposal: took = 857.871µs
# etcdctl endpoint status --cluster -w table:
+---------------------+------------------+---------+---------+-----------+-----------+------------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+---------------------+------------------+---------+---------+-----------+-----------+------------+
| http://ceph-03:2379 | e01b8ec12d1d3b22 | 3.3.11 | 635 kB | true | 305 | 16942 |
| http://ceph-01:2379 | e13cf5b0920c4769 | 3.3.11 | 627 kB | false | 305 | 16943 |
| http://ceph-02:2379 | fd3d871bd684ee85 | 3.3.11 | 631 kB | false | 305 | 16944 |
+---------------------+------------------+---------+---------+-----------+-----------+------------+
# etcdctl endpoint hashkv --cluster:
http://ceph-03:2379, 2236472732
http://ceph-01:2379, 1950304164
http://ceph-02:2379, 494595250
I couldn’t figure out how to check the actual cluster health with etcdctl. All I seem to be able to do is check endpoint health, but the implication “all end points healthy” -> “cluster is healthy” does clearly not hold.
The cluster is running etcd 3.3.11 on Centos 7.7 with the stock packages. I attached a number of files:
etcd.info.txt - some info collected according to instructions.
etcd-ceph-01.conf.txt, etcd-ceph-02.conf.txt, etcd-ceph-03.conf.txt - the 3 config files of the etcd members
etcd-gnosis.conf.txt - a config file for a client using the etcd proxy service
etcd.log - the result of grep -e "etcd:" /var/log/messages
, the etcd log goes to syslog; this log should cover at least one occasion of loss of DB consistency
I cannot attach the data bases, because they contain credentials. However, I can - for some time - run commands on the data bases and post results. I have a little bit of time before I need to synchronize the servers again.
In the mean time, I could use some help with recovering. This is the first time all 3 instances are different. Usually, I have only 1 out-of-sync server and can get back to normal by removing and re-adding it. However, here I have now 3 different instances and it is no longer trivial to decide which copy is the latest. Therefore, if you could help me with these questions, I would be grateful:
- How can I print all keys with current revision number that a member holds?
- Is there a way to force the members to sync?
Thanks for your help.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 28 (14 by maintainers)
@jonaz aws team faced the same data inconsistency problem as well. After log diving, it appeared the
lease_revoke
apply failed due toleaseID not found
in one node. That node mvcc current revision is the smallest one compared with the other two nodes. The other two node log did not have thelease_revoke
failure for this leaseID.Under this condition, the lessor
Revoke
will fail to delete the associated key values cause it will error out in Line 314. So the revision will start to diverge at this point. https://github.com/etcd-io/etcd/blob/a905430d27ec7372267b1cf193f6aa6cda68adb6/lease/lessor.go#L308-L332Due to the kube-apiserver usage (specifically optimistic lock on the ResourceVersion) of etcd Txn like the following, the revision difference will be amplified and cause more serious cascading failures like
kube-scheduler
andkube-controll-er-manager
cannot acquire or update its endpoint lease.https://github.com/kubernetes/kubernetes/blob/release-1.22/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go#L422-L436
From your log post in the previous comment
The lease ID
-6f4bd77510edab61
technically will never be negative because etcd server guaranteed leaseID must be positive when processingLeaseGrant
requests. https://github.com/etcd-io/etcd/blob/a905430d27ec7372267b1cf193f6aa6cda68adb6/etcdserver/v3_server.go#L247-L258For this “bad” etcd cluster, we also dumped the db file, inspected the lease bucket and found out there were multiple corrupted leases.
After editing the
etcd-dump-db
to print the error instead of panic right away. It shows the lease ID (bolt key) is negative integer and value is not compatible with the lease proto.Interestingly, the corrupted leases (lease ID < 0) will be ignored when recovering lessor from the db file. So the key values associated those corrupted leases will never be deleted. https://github.com/etcd-io/etcd/blob/a905430d27ec7372267b1cf193f6aa6cda68adb6/lease/lessor.go#L770-L776
/cc @wilsonwang371 @ptabor Does the above explanation make sense to you?
Run a simulation script in a 1.21 kubernetes cluster.
Followed by
Remove member
andAdd member
runtime reconfiguration process one by one with 3 times.The revision gap across 3 etcd nodes became larger and larger.
The
kube-controller-manager
in kubernetes control plane node cannot acquire and renew its lease. The state of world and spec is out of sync over time.No, actually I am not sure about this yet. The root cause of our internal observation of DB out of sync is due to https://github.com/etcd-io/etcd/pull/13505
Hi @jonaz, we use a standard set-up and don’t do anything special. Leaders are elected by whatever etcd has implemented.
It looks like the devs are not going to address this. I plan to move away from etcd, it simply is too unreliable with this simple health-check and self-healing missing. Even if I purposefully corrupted the DB, it should detect and, if possible, fix it. However, it doesn’t even detect an accidental corruption and with the lack of interest to fix this, I cannot rely on it.
The same applies to your situation. It is completely irrelevant how you manage to throw an etcd instance off. The etcd cluster should detect and fix the problem or err out. However, it simply continues to answer requests and happily distributes inconsistent data. That’s not acceptable for a distributed data base.
If the idea of the devs is that a user adds a cron job for health checks herself, then the least that should be available is a command to force a re-sync of an instance against the others without the pain of removing and re-adding an instance.