kubernetes: IPVS de-DNAT bug:Pod(or container) often cannot access two Virtual Server ip with the same endpoint ip (real server) at the same time

What happened?

The POD cannot stably access two VIP with the same endpoint ip at the same time The POD cannot stably access VIP and the endpoint ip at the same time too

VIP1: TCP 10.85.128.213:80 rr -> 10.244.1.132:9376 Masq 1 0 0
VIP2:
TCP 10.85.128.215:80 rr -> 10.244.1.132:9376 Masq 1 0 0

  1. access two VIP at the same time
enter an pod and execute four command:
while true;do nc -w 1 10.85.128.213 80 < /dev/null;done
while true;do nc -w 1 10.85.128.213 80 < /dev/null;done
while true;do nc -w 1 10.85.128.215 80 < /dev/null;done
while true;do nc -w 1 10.85.128.215 80 < /dev/null;done

A few minutes later
lots of ERROR:    Ncat: Connection timed out.
  1. access VIP and endpoints ip at the same time
enter an pod and execute two command:
while true;do nc -w 1 10.85.128.213 80 < /dev/null;done
while true;do nc -w 1 10.244.1.132 9376 < /dev/null;done

A few minutes later
lots of ERROR:    Ncat: Connection timed out.

This problem does not happen with proxy-mode iptables.

I read the kernel tcp connect assign port code, netifilter conntrack code and IPVS code Found a bug in ipvs de-DNAT The connection tracking table of ipvs in de-DNAT matches incorrectly, but the conntrack used by iptables has no problem

I modified the source code of ipvs connection tracking table matching Principle: Obtain information from the connection table of conntrack during the ipvs connection tracking matching process solved this problem

image

Two related issues Kube-proxy/ipvs;Frequently cannot simultaneous access clusterip and endpoint ip(real server) in pod https://github.com/kubernetes/kubernetes/issues/90042 connection time out for cluster ip of api-server by accident https://github.com/kubernetes/kubernetes/issues/90258

kernel-5.4.54 patch

diff -pur linux-5.4.54.orig/include/net/ip_vs.h linux-5.4.54/include/net/ip_vs.h
--- linux-5.4.54.orig/include/net/ip_vs.h	2021-12-09 14:23:21.483405600 +0800
+++ linux-5.4.54/include/net/ip_vs.h	2021-12-27 20:16:09.085029500 +0800
@@ -1204,6 +1204,7 @@ struct ip_vs_conn * ip_vs_conn_in_get_pr
 					    const struct ip_vs_iphdr *iph);
 
 struct ip_vs_conn *ip_vs_conn_out_get(const struct ip_vs_conn_param *p);
+struct ip_vs_conn *ip_vs_conn_out_new_get(const struct ip_vs_conn_param *p, const struct sk_buff *skb);
 
 struct ip_vs_conn * ip_vs_conn_out_get_proto(struct netns_ipvs *ipvs, int af,
 					     const struct sk_buff *skb,
diff -pur linux-5.4.54.orig/net/netfilter/ipvs/ip_vs_conn.c linux-5.4.54/net/netfilter/ipvs/ip_vs_conn.c
--- linux-5.4.54.orig/net/netfilter/ipvs/ip_vs_conn.c	2021-12-09 14:23:22.264155000 +0800
+++ linux-5.4.54/net/netfilter/ipvs/ip_vs_conn.c	2021-12-29 16:03:09.109127000 +0800
@@ -35,7 +35,7 @@
 
 #include <net/net_namespace.h>
 #include <net/ip_vs.h>
-
+#include <net/netfilter/nf_conntrack.h>
 
 #ifndef CONFIG_IP_VS_TAB_BITS
 #define CONFIG_IP_VS_TAB_BITS	12
@@ -436,6 +436,57 @@ struct ip_vs_conn *ip_vs_conn_out_get(co
 	return ret;
 }
 
+struct ip_vs_conn *ip_vs_conn_out_new_get(const struct ip_vs_conn_param *p, const struct sk_buff *skb)
+{
+	unsigned int hash;
+	struct ip_vs_conn *cp, *ret=NULL;
+	struct nf_conn *ct;
+	struct nf_conntrack_tuple *tuple;
+	enum ip_conntrack_info ctinfo;
+
+	/*
+	 *	Check for "full" addressed entries
+	 */
+	hash = ip_vs_conn_hashkey_param(p, true);
+
+	rcu_read_lock();
+
+	hlist_for_each_entry_rcu(cp, &ip_vs_conn_tab[hash], c_list) {
+		if (p->vport == cp->cport && p->cport == cp->dport &&
+		    cp->af == p->af &&
+		    ip_vs_addr_equal(p->af, p->vaddr, &cp->caddr) &&
+		    ip_vs_addr_equal(p->af, p->caddr, &cp->daddr) &&
+		    p->protocol == cp->protocol &&
+		    cp->ipvs == p->ipvs) {
+
+			ct = nf_ct_get(skb, &ctinfo);
+			if (likely(ct)) {
+				tuple = &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple;
+				if (!ip_vs_addr_equal(p->af, &tuple->dst.u3, &cp->vaddr) ||
+					!tuple->dst.u.all == cp->vport) {
+					continue;
+				}
+			}
+
+			if (!__ip_vs_conn_get(cp))
+				continue;
+			/* HIT */
+			ret = cp;
+			break;
+		}
+	}
+
+	rcu_read_unlock();
+
+	IP_VS_DBG_BUF(9, "lookup/out %s %s:%d->%s:%d %s\n",
+		      ip_vs_proto_name(p->protocol),
+		      IP_VS_DBG_ADDR(p->af, p->caddr), ntohs(p->cport),
+		      IP_VS_DBG_ADDR(p->af, p->vaddr), ntohs(p->vport),
+		      ret ? "hit" : "not hit");
+
+	return ret;
+}
+
 struct ip_vs_conn *
 ip_vs_conn_out_get_proto(struct netns_ipvs *ipvs, int af,
 			 const struct sk_buff *skb,
@@ -446,7 +497,7 @@ ip_vs_conn_out_get_proto(struct netns_ip
 	if (ip_vs_conn_fill_param_proto(ipvs, af, skb, iph, &p))
 		return NULL;
 
-	return ip_vs_conn_out_get(&p);
+	return ip_vs_conn_out_new_get(&p, skb);
 }
 EXPORT_SYMBOL_GPL(ip_vs_conn_out_get_proto);

What did you expect to happen?

Stable access to two VIPs

How can we reproduce it (as minimally and precisely as possible)?

enter an pod and execute four command: while true;do nc -w 1 10.85.128.213 80 < /dev/null;done while true;do nc -w 1 10.85.128.213 80 < /dev/null;done while true;do nc -w 1 10.85.128.215 80 < /dev/null;done while true;do nc -w 1 10.85.128.215 80 < /dev/null;done

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
# paste output here
v1.22.4

Cloud provider

OS version

$ uname -a
# paste output here
5.4.54
</details>


### Install tools

<details>

</details>


### Container runtime (CRI) and and version (if applicable)

<details>

</details>


### Related plugins (CNI, CSI, ...) and versions (if applicable)

<details>

</details>

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 4
  • Comments: 15 (7 by maintainers)

Most upvoted comments

this problem can be avoided by using a high kernel

that is the main drawback of using kernel based implementations, hard to upgrade, hard to fix, …

vip1 reply packet skb matched the wrong conntrack, causes access timeout. kernel-5.4.54 ensures that the wrong conntract is not matched by matching srcip:port + destip:port + virtulIP:port,For kubernetes, this problem can be avoided by using a high kernel version, but for lower kernel versions, we need to circumvent this bug manually, what can we do on the kube-proxy?