kubernetes: apiserver timeouts after rolling-update of etcd cluster
Kubernetes version: 1.6.2 Etcd version: 3.1.8
Environment:
- apiservers are Virtual Machines on KVM
- Debian jessie
- Kernel: 3.16.0-4-amd64
- Others: both etcd and kube-apiserver run on the same machines
What happened: We were doing an upgrade of the configuration of etcd (flags election_timeout and heartbeat_interval) We did upgrade all of our etcd servers one at a time (we have 5) We did check that the etcd cluster was healthy issuing etcdctl cluster-health and etcdctl member list Then kube-apiserver started to behave erratically, giving timeout to almost every request sent On the logs of the apiserver we can see lots of lines like that:
E0607 17:45:11.447234 367 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
E0607 17:45:11.452760 367 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
E0607 17:45:11.452898 367 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
E0607 17:45:11.453120 367 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
In order for the apiservers to start behaving correctly, we had to perform a restart of the kube-apiserver service (just that one service, in all of our apiservers)
We did this twice, and twice it did happen the same. The cluster is in production, so we cannot risk a third outage to reproduce it again. The two times we tried it the behaviour was pretty consistent
What you expected to happen: That etcd just updated its configuration and apiserver never stoped working
How to reproduce it (as minimally and precisely as possible):
- Change the flags election-timeout and heartbeat-interval and then do a rolling restart on the etcd cluster
- Query apiserver … and it should be failing
Anything else we need to know:
Just ask 😃
About this issue
- Original URL
- State: closed
- Created 7 years ago
- Reactions: 9
- Comments: 67 (45 by maintainers)
Not to add to the noise, but we’ve encountered this issue in production, and it ended up leading to a complete cluster outage (through a very unfortunate series of events.)
I have gathered all the relevant logs from our 3 k8s master and 9 etcd nodes. There may not be anything of additional interest there, but if you would like to see them please let me know and I can share them privately.
Additional details from @obeattie https://community.monzo.com/t/current-account-payments-may-fail-major-outage/26296/95?u=oliver
3.2.10 is available
This problem should be solved or at least mitigated by https://github.com/kubernetes/kubernetes/pull/57160. The PR bumps both gRPC and etcd client to fix the timeout problem caused by connection reset and balancing.
I am closing out this issue. If anyone still see the timeout problem in a release (after 1.10) with https://github.com/kubernetes/kubernetes/pull/57160, please create a new issue with reproduce steps.
I’m hitting this error on 1.7.5 and etcd 3.1.8 but not due to an etcd rollout, it simply just started occurring on my cluster to the point that its actively affecting usability. Are there any known work arounds?
@obeattie I’m sooo sorry. I’ll update the client tomorrow, and going to poke folks about getting the next rev in line for release. /cc @luxas @roberthbailey @jbeda
@jpbetz
It would be great if we can get things https://github.com/coreos/etcd/issues/8711 done to rule out etcd issues.
etcdctl ls /uses the etcd v2 API, which is not what kubernetes 1.6+ uses by defaultETCDCTL_API=3 etcdctl get --prefix /uses the etcd v3 API, which will show you your kubernetes dataSimilar problem (see log below).
In short - installing new K8S master node based on this guide: https://coreos.com/kubernetes/docs/latest/getting-started.html - when trying to start kubelet service on it - everything starts up, but the apiserver all the time crashes (?). Due to that worker node can’t register it self as well master node isn’t fully working.
CoreOS Linux: 1492.4.0 etcd version: 3.2.5 (cluster from 4 nodes) kubernetes version 1.6.1_coreos.0
Occasionally it runs further: