kubernetes: Slow connect (TCP retransmits) via NodePort

What happened: In out archutecture we have some kinda external (out-of-cluster) Ingress Controller, based on HAProxy + self-written scripts for Kubernetes service discovery (2 instances). ~60 Kubernetes services exposed via NodePort on 8 Kubernetes nodes. Each node runs kube-proxy in iptables mode.

Everything worked fine. But after cluster got more load (HTTP requests per second / concurrent connections), we are experiencing slow connects to services exposed via NodePort because of TCP retransmits.

For now have 4k RPS / 80k concurrent peak. TCP retransmits starts at ~1k RPS / 30k concurrent.

But most strange thing in this situation - retransmit count not same for haproxy/kube-node pair. For example, haproxy1 have retransmits from kube-nodes 1,2,3 and 8, but haproxy2 have almost zero retransmits from that nodes. Instead haproxy2 have retransmits from kube-nodes 4,5,6 and 7. As you can see, it is like mirrored.

See attachments for clarification. HAProxy configured with 100ms connect timeout, so it redispatches connection on timeout. haproxy1 haproxy2

What you expected to happen: No TCP retransmits, fast connects.

How to reproduce it (as minimally and precisely as possible): Commit 50-60 deployments + NodePort-exposed services on few nodes. Load with 1k+ RPS, 30k+ concurrent cons. Observe slow connects (1s, 3s, 6s…)

Anything else we need to know?: Intercluster communication via flannel w/o cni in hostgw mode.

Tried different sysctls on nodes and haproxies. Tried ipvs mode and got much more TCP retransmits. Also tried with iptables 1.6.2 with latest flanneld to fix NAT bugs according to this article: https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02

For test installed on kube-nodes out-of-cluster reverse-proxy to pass traffic from outside to kubernetes services and pods - no problems. Also no problems with HostPort-exposed services.

Environment:

Kubernetes version (use kubectl version): Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.6", GitCommit:"b1d75deca493a24a2f87eb1efde1a569e52fc8d9", GitTreeState:"clean", BuildDate:"2018-12-16T04:39:52Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.6", GitCommit:"b1d75deca493a24a2f87eb1efde1a569e52fc8d9", GitTreeState:"clean", BuildDate:"2018-12-16T04:30:10Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider or hardware configuration: Masters: 5 x kvm VMs 16.04.6 LTS (Xenial Xerus) / 4.15.0-45-generic. Nodes: Baremetall Supermicro Intel® Xeon® CPU E5-2695 / 128 Gb RAM
OS (e.g: cat /etc/os-release): Prod nodes: 16.04.6 LTS (Xenial Xerus) Test node: Debian GNU/Linux 9 (stretch)
Kernel (e.g. uname -a): Prod nodes: Linux hw-kube-n1.alaps.kz.prod.bash.kz 4.15.0-45-generic #48~16.04.1-Ubuntu SMP Tue Jan 29 18:03:48 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux Test node: Linux hw-kube-n8.--- 4.9.0-8-amd64 #1 SMP Debian 4.9.144-3.1 (2019-02-19) x86_64 GNU/Linux
Install tools: Mix of hard way / ansible.
Others: Don’t know if it is kube-proxy / iptables problem or maybe I’m just missing some sysctls / kernel params.

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 6
Comments: 54 (26 by maintainers)

Most upvoted comments

We’ve been experiencing a nearly identical problem with a nearly identical haproxy setup as @zigmund. From the haproxy side, it manifests as a high Tc value, which we can mitigate somewhat by setting a low timeout connect + a high number of retries. It manifests both for ingress into NodePort services and also egress from the Pod network to outside of the cluster. In both cases, we see a delay between the SYN and SYN-ACK when opening connections (both into and out of the cluster). The egress is more difficult to mitigate and results in failed connections to external services such as databases.

In a packet capture, it looks like this:

Note: there is a 750ms delta between the SYN and SYN-ACK packet.

This occurs on all of our clusters regardless of Kubernetes, Calico (our CNI), or kernel version. Our clusters range from Kubernetes 1.10 up to 1.14.1 w/ kernel version 4.9 or 4.19. More heavily utilized clusters see this problem more frequently, but even mostly idle clusters demonstrate this behavior.

We did identify a single workload that seemed to be exacerbating this problem, though we don’t know why. Removing the application improved performance, but did not eliminate the behavior. The only defining characteristics of this application was that it is:

an unused http service (zero traffic coming into it)
spawning a large number of threads (160)
restarting frequently
heavily CPU throttled

To quantify this, we have a DaemonSet that does nothing but make tiny http calls out to an extermal load balancer with a low tcp connect timeout and lots of retries & exponential backoff. We find that frequently even after 10 retries, the TCP connection will still timeout.

I’m happy to provide packet captures, settings, and communicate anything we find along the way.

aaronbbrown on May 15, 2019

If it’s helpful https://github.blog/2019-11-21-debugging-network-stalls-on-kubernetes/ describes the approach we took at GitHub to finding and mitigating most of these network stalls.

aaronbbrown on Feb 27, 2020

What changed after 08:00 to make insert_failed/sec go up again?

It is our native load depending on daytime. I added kube-proxy rule at ~9:30 and docker rule at ~13:00.

Why are the Docker rules involved?

Since Docker have masquerade rule I decided to add random-fully to this rule too. The rule is for outgoing traffic from containers.

zigmund on May 31, 2019