kubernetes: apiserver's etcd client can get into a non-operational state, all calls on that resource 504

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened: The API server can get into a state where it’s etcd client can no longer process connections, and therefore all calls to an endpoint that uses that client will 504 (timeout). I’ve observed this specially in this situation:

We turned on datadog-agent v6 which does a WATCH on the Events resource to collect events. Shortly afterwards, we noticed that the events API would stop processing requests and always 504. This is the offending code in datadog: https://github.com/DataDog/datadog-agent/blob/7820efa6b45722ce141e6af46cb0d4ef966e4e7d/pkg/util/kubernetes/apiserver/events.go

After spending sometime debugging this, I was able to reproduce the problem via a straightforward program that mimics the same behavior - see this gist: https://gist.github.com/mahmoudm/287824642c963011d49413fd3b9552da

Specifically, if you are doing WATCH calls, and then stopping the watch before the api-server got the chance to respond, you are leaking connections in the corresponding etcd client. Eventually, the client runs out of grpc streams and all calls get stuck waiting for new streams. See this thread dump of an api-server in that state: https://gist.github.com/mahmoudm/94ad32718b54f86f3ecfc1f9fe02bff2

Note the 287 goroutines stuck at k8s.io/kubernetes/vendor/google.golang.org/grpc/transport.wait, waiting for a stream.

What you expected to happen:

I expect that terminating WATCHs in the pattern described above should not cause the API server to get into a state where the corresponding resource stops working entirely.

How to reproduce it (as minimally and precisely as possible):

  1. Build and compile the go code in https://gist.github.com/mahmoudm/287824642c963011d49413fd3b9552da

  2. Run the binary with values such that your WATCH call times out before it receives a response. In other words, there needs to be enough load on your etcd cluster that you don’t receive a response before calling Stop() on the watcher. I am able to reproduce it with these values:

./watchs-spam ~/.kube/config 100ms 50ms With a 2000 pod deployment, and 100 nodes on my cluster.

  1. Observe that after a few minutes, the events API will no longer respond and will consistently return 504s for any event call. If you look at the goroutines, you’ll see that they are all stuck waiting for streams since they are no longer able to acquire any.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): I was able to repro on both 1.9.8 and 1.10.5
  • Cloud provider or hardware configuration: aws
  • OS (e.g. from /etc/os-release): CentOS Linux 7
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 23 (13 by maintainers)

Commits related to this issue

Most upvoted comments

I am also facing this issue while doing Etcd negative test. I brought down the Etcd leader node and ran “kubectl get pods --all-namespaces”

It was observed that it takes API server around 20 minutes to resume connection with next etcd. It was verified that Etcd works fine, since i can see leader election happening properly. So conclusion is connectivity between etcd and API server takes too long time to retry.

On more experiment, it appears, etcd client in API server does not honor keep alive and connection takes like 15 tcp retries with exponential backoff before tearing down existing connection with Etcd.

By varying the no of retries, i can vary the time API server takes to reestablish connection with etcd. "by setting tcp retries sysctl -w net.ipv4.tcp_retries2=1 sysctl -w net.ipv4.tcp_syn_retries=2

I ran experiments to determine how no of retires affect connection reset b/w Etcd and APIserver: when retries =1 , it is immediate, no delay in API connection to Etcd when retires = 5 , it is in few seconds ~2-3 seconds when retries = 10, it is ~ 5 minutes when retries = 15, response time of k8 ~ 20 minutes"

Also same way by setting keepalive to 10 seconds and reducing time between keepalive probes at TCP level also enabled API server to talk to next available Etcd.

What is the update on issue? When can we expect this issue to be fixed