kubernetes: kube-proxy IP-tables blocks connections to ExternalTrafficPolicy:Cluster

What happened:

There is a 5-node k8s cluster with two masters. Masters are tainted to run workloads.
We deploy kong as Kubernetes Service of type NodePort (config here). External applications do TCP connections (HTTP 1.0, no keep-alive) to perform health-checks every 15 seconds. These health-checks succeed and fail randomly because of TCP connection issues. The 3-way handshake never finishes sometimes. See more details below.
This is not a problem with other proxies, but only happens when Kong is running. One would suspect something is wrong with Kong but we’ve stripped down the health-check endpoint to a simple Nginx instance returning a 200 on location /health.

What you expected to happen:

TCP connections should succeed.

How to reproduce it (as minimally and precisely as possible):

This is extremely hard to reproduce.

Anything else we need to know?:

In this setup, the above issue happens only on the specific node on which the pod is running. Connections via other k8s worker/master nodes to the pod succeeds always.
On the worker node on which the pod runs, the connection is successful when hitting the Docker container directly but it doesn’t work if the connection is made to IP of the worker node. That is to say, as soon as the IP tables kicks in, things go wrong.
The connections do succeed sporadically but it is totally random. There are no errors in kernel logs.
tcpdump and conntrack tables show that the SYN arrives at the host network but then there is a time out.
ExternalTrafficPolicy: Local works fine and has no issues at all in this setup.

$ sysctl -p
net.core.somaxconn = 50000
net.ipv4.tcp_max_syn_backlog = 50000
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_keepalive_time = 2500
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.ip_forward = 1
net.ipv4.ip_local_reserved_ports = 30000-32767
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-arptables = 1
net.bridge.bridge-nf-call-ip6tables = 1

Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.0", GitCommit:"2bd9643cee5b3b3a5ecbd3af49d09018f0773c77", GitTreeState:"clean", BuildDate:"2019-09-18T14:27:17Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.0", GitCommit:"2bd9643cee5b3b3a5ecbd3af49d09018f0773c77", GitTreeState:"clean", BuildDate:"2019-09-18T14:27:17Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration: Bare-metal cluster
OS (e.g: cat /etc/os-release):

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"
CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Kernel (e.g. uname -a):

Linux nsd-on-hood-k8s-master-01 3.10.0-957.27.2.el7.x86_64 #1 SMP Mon Jul 29 17:46:05 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Install tools:
Network plugin and version (if this is a network-related bug): Calico and flannel, both have the same result
Others:

We’ve been trying to debug this for weeks but have not been able to make much progress. Any clue as to what can be going wrong here? We’re happy to run some other tests in the cluster or provide more details as necessary.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 18 (13 by maintainers)

Most upvoted comments

I’m sorry about all the confusion in this thread. The externalTrafficPolicy is named in such a way that it is almost impossible to figure out which one does what. And cloud provider implementations (except Google) not respecting that makes the whole thing more difficult.

Here is the thing to clear out the confusion between the traffic policy:

Local: In this case, only the worker nodes that are running the pod respond to requests. The local kube-proxy sends traffic to the local pod and everything is fine. We currently run DaemonSet with Local to get around the bug in this issue.
Cluster: This is what we would like to use but we can’t. Suppose two pods are running in a cluster of 5 worker nodes. In this case, the request can be sent to any of the worker nodes and I expect kube-proxy to forward it to the right node. This is where we observe the bug.

The bug being if we send a request to the worker node that is actually running the pod, it doesn’t work. The first TCP connection succeeds (no keepalives). But subsequent connections fail for sometime. One connection would succeed sporadically. If the source IP is different, the same behavior persists.

convinced that conntrack is at fault here

Correct. We used conntrack for visibility and observations only.

Do you have any network policies installed on these clusters?

hbagdi on Apr 2, 2020

ping @caseydavenport

thockin on Apr 2, 2020

@aojea Already tested that but no luck.

hbagdi on Jan 13, 2020