kubernetes: kube-proxy: Drop packets in INVALID state drops packets from outside the pod range

What happened: When kube-proxy is configured with a cluster-cidr range, it drops packets with an INVALID conntrack state for all sources on the KUBE-FORWARD chain. This is meant to prevent spurious retransmits in long-lived TCP connections to a service IP, but it prevents clusters from having asymmetrical routing on private subnets with services outside of a cluster.

What you expected to happen: Packets on the KUBE-FORWARD chain with an INVALID conntrack state should be dropped for only sources inside of the cluster-cidr if set.

How to reproduce it (as minimally and precisely as possible): N/A

Anything else we need to know?: Change to drop INVALID conntrack packets was in https://github.com/kubernetes/kubernetes/pull/74840. Backing out that commit and compiling is my current workaround.

Environment:

  • Kubernetes version (use kubectl version): 1.17.2
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
  • Kernel (e.g. uname -a): Linux 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 4
  • Comments: 67 (45 by maintainers)

Most upvoted comments

The same affects outgoing traffic if packets are not masqueraded to the node-ip, as should be a common setup for ipv6, please see https://github.com/kubernetes/kubernetes/issues/94861#issuecomment-899119715 @telmich

Update

For Calico I think you can setup BGP for clusterCIDRs and route incoming packets to PODs to the correct K8s node, instead of using ECMP. This would solve the problem, and save the extra hop.

I did not understand how you could get asymmetric routing in K8s, but reading up more carefully I saw:

using a custom ingress solution (HAProxy w/an outside process configuring it for pod ips / mesos ip-ports) for both

If I understand this correctly, this is a load-balancer outside K8s that load-balance directly to POD addresses. To get asymmetric routing some kind of Direct Server Return (DSR) must also be used. So, to get asymmetric routing:

  1. Traffic from an external source is directed to an external LB
  2. The LB is configured with POD-IPs as targets and selects one
  3. The packet is routed to a K8s node, e.g. with ECMP. But is may not be the node where the target POD is executing
  4. The packet is handled by the POD, but the response packet may be directed to some other GW

This will give asymmetric routing. But it’s not the every-day K8s network setup for sure.

Given the above it’s fairly easy to reproduce the problem. You don’t even have to setup a load-balancer. The biggest problem may be that you need two ways in to the cluster. Example:

net-setup

I setup direct routes that ensures asymmetric routing, which can be checked with tcpdump on vm-201 and vm-202. From vm-221 I can now directly ping POD-IPs, but due to the problem in this issue, I can’t make a TCP connection:

vm-221 ~ # ping -c1 11.0.1.2
PING 11.0.1.2 (11.0.1.2) 56(84) bytes of data.
64 bytes from 11.0.1.2: icmp_seq=1 ttl=62 time=1.42 ms

--- 11.0.1.2 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1.416/1.416/1.416/0.000 ms
vm-221 ~ # nc 11.0.1.2 5001
(hangs...)

The POD with address 11.0.1.2 runs on vm-001.

A similar setup as above can be used to verify a PR.

Now, there is a simple work-around: use proxy-mode=ipvs

The update with the DROP rule never made it to the ipvs proxier (a rather common case 😢 ), but in this case it may save your day. The same as above with proxy-mode=ipvs;

vm-221 ~ # ping -c1 11.0.1.2
PING 11.0.1.2 (11.0.1.2) 56(84) bytes of data.
64 bytes from 11.0.1.2: icmp_seq=1 ttl=62 time=2.27 ms

--- 11.0.1.2 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 2.273/2.273/2.273/0.000 ms
vm-221 ~ # nc 11.0.1.2 5001
tserver-55b644d88b-5lsq5

The server on 5001 responds with the POD name.

OK, drop if INVALID and to a pod is help us identify cluster traffic, right? We only drop cluster traffic if it’s invalid, but this will not solve the problem of external IP. So we need a flag to disable the rule is drop if INVALID?

So in the absence of a way to fix the problem with external IPs, can we have a flag to decide whether to create this rule or not?

If true (default), create the drop if INVALID and to a pod rule. Otherwise, We do not create the rule. I believe that even without this rule we can solve this problem by way of sysctl, right?

I think we should do the following things:

  • We need a flag to disable the rule
  • We should change the “drop if INVALID” rule to “drop if INVALID and to a pod”
  • We document sysctl as a way to solve the problem.

There’s related discussion in #117924. I think we should create whatever iptables rules we can that will fix up our own traffic without interfering with other traffic, and also document the sysctl that admins can use to change conntrack behavior to work around the problem.

So probably, we should change the “drop if INVALID” rule to “drop if INVALID and to a pod”, and also address the problem in #117924, but not try to solve the problem with external IPs I mentioned in https://github.com/kubernetes/kubernetes/issues/94861#issuecomment-1548020178 because we don’t have a good way of fixing that without potentially breaking other things. (And it’s kind of way more of an edge case anyway.)

Note that LocalTrafficDetector can currently output rules for “from a pod”, but not for “to a pod”, so that will need some tweaking. (FWIW LocalTrafficDetector will also eventually need the ability to output nftables rules rather than iptables rules so that’s another thing to think about if someone is going to change its API.)

I think that is only INVALID and to a pod https://kubernetes.io/blog/2019/03/29/kube-proxy-subtleties-debugging-an-intermittent-connection-reset/

So, based on that, the bug happens when a reply packet fails to get un-DNAT’ed, causing the client to then send a RST to the server, which matches the source/destination IP that the server knows of.

But this could happen if you had a service with external IPs too; in that case the reply packet would be from the external endpoint IP to the node IP (because the packet would have been masqueraded), and if conntrack didn’t unmap it correctly then the node would send a RST to the external IP, breaking the connection in just the same way.

In this case there is no good way kube-proxy can recognize the packet as being its own, other than by using conntrack, which presumably won’t work if the packet is INVALID?

@aojea I don’t understand how that will help me here. The packets with a ctstate INVALID are not generated due to conntrack saturation (as described here), but due to asymmetric routing in our network. See the blog post I linked earlier for an example of a similar setup.