kubernetes: apiserver timeouts after rolling-update of etcd cluster

Kubernetes version: 1.6.2 Etcd version: 3.1.8

Environment:

apiservers are Virtual Machines on KVM
Debian jessie
Kernel: 3.16.0-4-amd64
Others: both etcd and kube-apiserver run on the same machines

What happened: We were doing an upgrade of the configuration of etcd (flags election_timeout and heartbeat_interval) We did upgrade all of our etcd servers one at a time (we have 5) We did check that the etcd cluster was healthy issuing etcdctl cluster-health and etcdctl member list Then kube-apiserver started to behave erratically, giving timeout to almost every request sent On the logs of the apiserver we can see lots of lines like that:

E0607 17:45:11.447234     367 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
E0607 17:45:11.452760     367 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
E0607 17:45:11.452898     367 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted
E0607 17:45:11.453120     367 watcher.go:188] watch chan error: etcdserver: mvcc: required revision has been compacted

In order for the apiservers to start behaving correctly, we had to perform a restart of the kube-apiserver service (just that one service, in all of our apiservers)

We did this twice, and twice it did happen the same. The cluster is in production, so we cannot risk a third outage to reproduce it again. The two times we tried it the behaviour was pretty consistent

What you expected to happen: That etcd just updated its configuration and apiserver never stoped working

How to reproduce it (as minimally and precisely as possible):

Change the flags election-timeout and heartbeat-interval and then do a rolling restart on the etcd cluster
Query apiserver … and it should be failing

Anything else we need to know:

Just ask 😃

About this issue

Original URL
State: closed
Created 7 years ago
Reactions: 9
Comments: 67 (45 by maintainers)

Most upvoted comments

Not to add to the noise, but we’ve encountered this issue in production, and it ended up leading to a complete cluster outage (through a very unfortunate series of events.)

I have gathered all the relevant logs from our 3 k8s master and 9 etcd nodes. There may not be anything of additional interest there, but if you would like to see them please let me know and I can share them privately.

+10

obeattie on Oct 29, 2017

Additional details from @obeattie https://community.monzo.com/t/current-account-payments-may-fail-major-outage/26296/95?u=oliver

dims on Oct 31, 2017

3.2.10 is available

jpbetz on Nov 16, 2017

This problem should be solved or at least mitigated by https://github.com/kubernetes/kubernetes/pull/57160. The PR bumps both gRPC and etcd client to fix the timeout problem caused by connection reset and balancing.

I am closing out this issue. If anyone still see the timeout problem in a release (after 1.10) with https://github.com/kubernetes/kubernetes/pull/57160, please create a new issue with reproduce steps.

xiang90 on Feb 12, 2018

I’m hitting this error on 1.7.5 and etcd 3.1.8 but not due to an etcd rollout, it simply just started occurring on my cluster to the point that its actively affecting usability. Are there any known work arounds?

metral on Nov 5, 2017

@obeattie I’m sooo sorry. I’ll update the client tomorrow, and going to poke folks about getting the next rev in line for release. /cc @luxas @roberthbailey @jbeda

timothysc on Oct 31, 2017

@jpbetz

It would be great if we can get things https://github.com/coreos/etcd/issues/8711 done to rule out etcd issues.

xiang90 on Oct 24, 2017

etcdctl ls / uses the etcd v2 API, which is not what kubernetes 1.6+ uses by default

ETCDCTL_API=3 etcdctl get --prefix / uses the etcd v3 API, which will show you your kubernetes data

liggitt on Aug 29, 2017

Similar problem (see log below).

In short - installing new K8S master node based on this guide: https://coreos.com/kubernetes/docs/latest/getting-started.html - when trying to start kubelet service on it - everything starts up, but the apiserver all the time crashes (?). Due to that worker node can’t register it self as well master node isn’t fully working.

CoreOS Linux: 1492.4.0 etcd version: 3.2.5 (cluster from 4 nodes) kubernetes version 1.6.1_coreos.0

E0814 14:32:14.665943       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.ResourceQuota: Get https://localhost:443/api/v1/resourcequotas?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
E0814 14:32:14.665943       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.Secret: Get https://localhost:443/api/v1/secrets?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
E0814 14:32:14.665943       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.Namespace: Get https://localhost:443/api/v1/namespaces?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
E0814 14:32:14.665947       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.LimitRange: Get https://localhost:443/api/v1/limitranges?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
E0814 14:32:14.665950       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *storage.StorageClass: Get https://localhost:443/apis/storage.k8s.io/v1beta1/storageclasses?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
E0814 14:32:14.665950       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.ServiceAccount: Get https://localhost:443/api/v1/serviceaccounts?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
[restful] 2017/08/14 14:32:14 log.go:30: [restful/swagger] listing is available at https://192.168.178.155:443/swaggerapi/
[restful] 2017/08/14 14:32:14 log.go:30: [restful/swagger] https://192.168.178.155:443/swaggerui/ is mapped to folder /swagger-ui/
I0814 14:32:14.740919       1 serve.go:79] Serving securely on 0.0.0.0:443
I0814 14:32:14.741036       1 serve.go:94] Serving insecurely on 127.0.0.1:8080
I0814 14:32:25.189208       1 trace.go:61] Trace "Create /api/v1/namespaces/kube-system/pods" (started 2017-08-14 14:32:15.188789893 +0000 UTC):
[25.862µs] [25.862µs] About to convert to expected version
[138.138µs] [112.276µs] Conversion done
"Create /api/v1/namespaces/kube-system/pods" [10.00036266s] [10.000224522s] END
I0814 14:32:29.307733       1 trace.go:61] Trace "Create /api/v1/namespaces/kube-system/pods" (started 2017-08-14 14:32:19.307339406 +0000 UTC):
[50.67µs] [50.67µs] About to convert to expected version
[178.618µs] [127.948µs] Conversion done
"Create /api/v1/namespaces/kube-system/pods" [10.000361s] [10.000182382s] END
I0814 14:32:43.639855       1 trace.go:61] Trace "Create /api/v1/namespaces/kube-system/pods" (started 2017-08-14 14:32:33.639564997 +0000 UTC):
[27.653µs] [27.653µs] About to convert to expected version
[109.672µs] [82.019µs] Conversion done
"Create /api/v1/namespaces/kube-system/pods" [10.000262566s] [10.000152894s] END

Occasionally it runs further:

E0814 15:43:13.660251       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *storage.StorageClass: Get https://localhost:443/apis/storage.k8s.io/v1beta1/storageclasses?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
E0814 15:43:13.660252       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.Secret: Get https://localhost:443/api/v1/secrets?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
E0814 15:43:13.660255       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.ResourceQuota: Get https://localhost:443/api/v1/resourcequotas?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
E0814 15:43:13.660257       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.LimitRange: Get https://localhost:443/api/v1/limitranges?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
E0814 15:43:13.660260       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.ServiceAccount: Get https://localhost:443/api/v1/serviceaccounts?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
E0814 15:43:13.660260       1 reflector.go:201] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:70: Failed to list *api.Namespace: Get https://localhost:443/api/v1/namespaces?resourceVersion=0: dial tcp [::1]:443: getsockopt: connection refused
[restful] 2017/08/14 15:43:13 log.go:30: [restful/swagger] listing is available at https://192.168.178.155:443/swaggerapi/
[restful] 2017/08/14 15:43:13 log.go:30: [restful/swagger] https://192.168.178.155:443/swaggerui/ is mapped to folder /swagger-ui/
I0814 15:43:13.734803       1 serve.go:79] Serving securely on 0.0.0.0:443
I0814 15:43:13.734949       1 serve.go:94] Serving insecurely on 127.0.0.1:8080
I0814 15:43:24.910942       1 trace.go:61] Trace "Create /api/v1/namespaces/default/events" (started 2017-08-14 15:43:14.910564996 +0000 UTC):
[21.041µs] [21.041µs] About to convert to expected version
[61.814µs] [40.773µs] Conversion done
"Create /api/v1/namespaces/default/events" [10.000326549s] [10.000264735s] END
I0814 15:43:34.912092       1 trace.go:61] Trace "Create /api/v1/namespaces/default/events" (started 2017-08-14 15:43:24.911732735 +0000 UTC):
[19.587µs] [19.587µs] About to convert to expected version
[48.974µs] [29.387µs] Conversion done
"Create /api/v1/namespaces/default/events" [10.000322522s] [10.000273548s] END
I0814 15:43:43.762259       1 trace.go:61] Trace "Create /api/v1/namespaces" (started 2017-08-14 15:43:13.761803275 +0000 UTC):
[23.309µs] [23.309µs] About to convert to expected version
[86.721µs] [63.412µs] Conversion done
[98.387µs] [11.666µs] About to store object in database
"Create /api/v1/namespaces" [30.000423311s] [30.000324924s] END
E0814 15:43:43.762743       1 client_ca_hook.go:58] Timeout: request did not complete within allowed duration
I0814 15:43:44.913107       1 trace.go:61] Trace "Create /api/v1/namespaces/default/events" (started 2017-08-14 15:43:34.912799142 +0000 UTC):
[19.308µs] [19.308µs] About to convert to expected version
[50.372µs] [31.064µs] Conversion done
"Create /api/v1/namespaces/default/events" [10.000280805s] [10.000230433s] END
I0814 15:43:45.902091       1 trace.go:61] Trace "Create /api/v1/namespaces/kube-system/pods" (started 2017-08-14 15:43:35.90159532 +0000 UTC):
[44.964µs] [44.964µs] About to convert to expected version
[179.268µs] [134.304µs] Conversion done
"Create /api/v1/namespaces/kube-system/pods" [10.000458778s] [10.00027951s] END
I0814 15:43:49.236026       1 trace.go:61] Trace "Create /api/v1/nodes" (started 2017-08-14 15:43:19.235552055 +0000 UTC):
[39.003µs] [39.003µs] About to convert to expected version
[118.204µs] [79.201µs] Conversion done
[129.321µs] [11.117µs] About to store object in database
"Create /api/v1/nodes" [30.000421849s] [30.000292528s] END
I0814 15:43:50.501101       1 trace.go:61] Trace "Create /api/v1/nodes" (started 2017-08-14 15:43:20.500588234 +0000 UTC):
[58.037µs] [58.037µs] About to convert to expected version
[176.001µs] [117.964µs] Conversion done
[181.311µs] [5.31µs] About to store object in database
"Create /api/v1/nodes" [30.000478595s] [30.000297284s] END
I0814 15:43:54.914119       1 trace.go:61] Trace "Create /api/v1/namespaces/default/events" (started 2017-08-14 15:43:44.913818764 +0000 UTC):
[18.28µs] [18.28µs] About to convert to expected version
[44.975µs] [26.695µs] Conversion done
"Create /api/v1/namespaces/default/events" [10.000262551s] [10.000217576s] END
I0814 15:44:02.669816       1 trace.go:61] Trace "Create /api/v1/namespaces/kube-system/pods" (started 2017-08-14 15:43:52.669523306 +0000 UTC):
[25.546µs] [25.546µs] About to convert to expected version
[95.48µs] [69.934µs] Conversion done
"Create /api/v1/namespaces/kube-system/pods" [10.000261635s] [10.000166155s] END

zxpower on Aug 15, 2017