kubernetes: "Connection reset by peer" due to invalid conntrack packets

What happened?

When packets with sequence number out-of-window arrived k8s node, conntrack marked them as INVALID. kube-proxy will ignore them, without rewriting DNAT. However, there is no corresponding session link on the host, and the host sends a reset packet, causing the session link to be interrupted

What did you expect to happen?

connection not reset

How can we reproduce it (as minimally and precisely as possible)?

https://github.com/kubernetes/kubernetes/issues/74839

Anything else we need to know?

This problem can be solved by command: iptables -t filter -I INPUT -p tcp -m conntrack --ctstate INVALID -j DROP

Similar issue:https://github.com/kubernetes/kubernetes/issues/74839 But this issue “drop” is placed on the forward chain, our scenario needs to be placed on the input chain.

Kubernetes version

$ kubectl version Client Version: version.Info{Major:“1”, Minor:“22”, GitVersion:“v1.22.0”, GitCommit:“c2b5237ccd9c0f1d600d3072634ca66cefdf272f”, GitTreeState:“clean”, BuildDate:“2021-08-04T18:03:20Z”, GoVersion:“go1.16.6”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“22”, GitVersion:“v1.22.0”, GitCommit:“c2b5237ccd9c0f1d600d3072634ca66cefdf272f”, GitTreeState:“clean”, BuildDate:“2021-08-04T17:57:25Z”, GoVersion:“go1.16.6”, Compiler:“gc”, Platform:“linux/amd64”}

Cloud provider

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, …) and versions (if applicable)

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 46 (45 by maintainers)

Most upvoted comments

(The regression test in the e2e suite creates a packet with an intentionally-bad sequence number, but it’s not clear to me what sort of real-world issue that’s supposed to be representing.)

https://kubernetes.io/blog/2019/03/29/kube-proxy-subtleties-debugging-an-intermittent-connection-reset/

We can document that we think it’s a good idea for users/distros to set ip_conntrack_tcp_be_liberal, and if we think it’s almost always a better idea than not, we can warn at startup if it’s not set, but (IMO) we shouldn’t set it ourselves.

Agree in general. We have done it in the past (e.g. route-localnet) and mostly it seems like a not-great idea.

If we can get the same effect with an iptables rule that only affects kube-proxy’s own traffic, then I think that’s better than setting a sysctl that will also affect non-kube-proxy traffic.

Also agree.

IMO kube-proxy should not set any sysctls that are not literally required for functionality that the user has explicitly opted into (eg net.ipv4.ip_forward for most network plugins). Kube-proxy does not own the host network namespace, and it should not be doing things that will affect other people’s host-network traffic, because if we do it’s going to break some users. (See also: #94861.)

We can document that we think it’s a good idea for users/distros to set ip_conntrack_tcp_be_liberal, and if we think it’s almost always a better idea than not, we can warn at startup if it’s not set, but (IMO) we shouldn’t set it ourselves.

If we can get the same effect with an iptables rule that only affects kube-proxy’s own traffic, then I think that’s better than setting a sysctl that will also affect non-kube-proxy traffic.

If I understand the situation here correctly, if kube-proxy added a drop rule for the invalid conntrack packets, but then the administrator set ip_conntrack_tcp_be_liberal, then the result would be that conntrack would not mark some packets as invalid, and so our drop rule would just not get hit, and so our drop rule wouldn’t interfere with the “better” sysctl-based solution?

Kernel ppl here concur:

"I say : if kernel can not be fixed, set tcp_be_liberal to one "

“having Kubernetes set net.netfilter.nf_conntrack_tcp_be_liberal=1 always SGTM, since you want it to work on any kernel version”

@aojea argues for a flag, default to true (or a hegative flag, defsult false) which seems a bit paranoid, but probably smart 😃

Are we all in agreement? Who wants to do the PR?

There are three potential “fixes” here:

  1. If this is unambiguously a conntrack bug, then we should figure out the details, and get the kernel devs to fix it. However, getting kernel bug fixes into all k8s clusters in the world takes “a long time”, so even if it is a kernel bug, we should still think about the other fixes.
  2. Individual cluster admins can use nf_conntrack_tcp_be_liberal. We should advertise this better, but we feel that it would be dubious to have kube-proxy set this flag itself (https://github.com/kubernetes/kubernetes/issues/117924#issuecomment-1548163880).
  3. As per https://github.com/kubernetes/kubernetes/issues/94861#issuecomment-1626025213 we could probably tweak the existing rule so that it didn’t interfere with non-k8s packets ~and as suggested by the OP of this issue, we could add a similar rule to the INPUT chain (though this would need a bit of further thinking about to make sure we weren’t introducing any new conflicts with non-k8s traffic)~.