kubernetes: kube-proxy ipvs conn_reuse_mode setting causes errors with high load from single client

What happened: The kube-proxy sets /proc/sys/net/ipv4/vs/conn_reuse_mode to zero https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/ipvs/proxier.go#L340

This has the effect that when:

  • kube-proxy uses ipvs mode and an ipvs virtualserver is configured with only a few realservers (pod replicas)
  • a client sends a lot of requests so that the source ports are reused before the MLS expires
  • a pod gets removed(e.g. via upgrade) removed while it receives traffic from this client

What happens in this situation is that the kube-proxy correctly sets the wait of the realserver to 0 and creates a new realserver with for the new pod that replaces the removed pod.

The problem is that the kube-proxy will not remove the weight zero realserver until its connections drop to zero. This will never happen as the conn_reuse_mode is set to 0 and the client reuses its source ports. This causes the kernel to constantly reuse connections and send the traffic to the weight 0 realserver.

What you expected to happen: When the old pod is removed its realserver receives no traffic anymore and is removed by kube-proxy

How to reproduce it (as minimally and precisely as possible): Deploy a service with with only a few replica pods as endpoints.

start sending lots of traffic to that service from a single client so the client will reuse its source ports sooner than the maximum segment lifetime of tcp connections. For example with fortio:

fortio load -t 0 -qps 1000 -c 16 -keepalive=0 http://10.86.6.96

delete a pod on the node.

Fortio will now produce following errors occassionally:

08:50:10 E http_client.go:558> Unable to connect to 10.86.6.96:80 : dial tcp 10.86.6.96:80: connect: no route to host
08:50:11 E http_client.go:558> Unable to connect to 10.86.6.96:80 : dial tcp 10.86.6.96:80: connect: no route to host
08:50:12 E http_client.go:558> Unable to connect to 10.86.6.96:80 : dial tcp 10.86.6.96:80: connect: no route to host
08:50:13 E http_client.go:558> Unable to connect to 10.86.6.96:80 : dial tcp 10.86.6.96:80: connect: no route to host

ipvs will look like this:

ipvsadm -l -t 10.86.6.96:80 -n
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.86.6.96:80 rr
  -> 100.66.61.116:8000           Masq    1	 0          5134
  -> 100.69.146.151:8000          Masq    0	 76         3956
  -> 100.69.146.185:8000          Masq    1	 0          967

kube-proxy will never remove the backends due to existing connections:

I0822 09:17:35.283821       1 graceful_termination.go:172] Not deleting, RS 10.86.6.96:80/TCP/100.69.146.151:8000: 33 ActiveConn, 38296 InactiveConn
I0822 09:18:35.283996       1 graceful_termination.go:161] Trying to delete rs: 10.86.6.96:80/TCP/100.69.146.151:8000
I0822 09:18:35.284111       1 graceful_termination.go:172] Not deleting, RS 10.86.6.96:80/TCP/100.69.146.151:8000: 76 ActiveConn, 11271 InactiveConn
I0822 09:19:35.284242       1 graceful_termination.go:161] Trying to delete rs: 10.86.6.96:80/TCP/100.69.146.151:8000
I0822 09:19:35.284324       1 graceful_termination.go:172] Not deleting, RS 10.86.6.96:80/TCP/100.69.146.151:8000: 58 ActiveConn, 0 InactiveConn

Anything else we need to know?:

setting the conn_reuse_mode to 1 fixes the problem, though that has performance impact (see #70747 and https://marc.info/?l=linux-virtual-server&m=151706660530133&w=2) maybe #81308 helps too using keepalive on the client also avoids this problem, though one may not be able to control all clients.

Environment:

  • Kubernetes version (use kubectl version): Client Version: version.Info{Major:“1”, Minor:“15”, GitVersion:“v1.15.0”, GitCommit:“e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529”, GitTreeState:“clean”, BuildDate:“2019-06-19T16:40:16Z”, GoVersion:“go1.12.5”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“14”, GitVersion:“v1.14.5”, GitCommit:“0e9fcb426b100a2aea5ed5c25b3d8cfbb01a8acf”, GitTreeState:“clean”, BuildDate:“2019-08-05T09:13:08Z”, GoVersion:“go1.12.5”, Compiler:“gc”, Platform:“linux/amd64”}

  • hardware

  • "Container Linux by CoreOS 2135.6.0 (Rhyolite)

  • kernel 4.19.56-coreos-r1

  • Install tools:

  • calico 3.8.2

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 14
  • Comments: 59 (44 by maintainers)

Commits related to this issue

Most upvoted comments

Hello everyone: We are very fortunate to tell you that this bug has been fixed by us and has been verified to work very well. The patch(ipvs: avoid drop first packet by reusing conntrack) is being submitted to the Linux kernel community. You can also apply this patch to your own kernel, and then only need to set net.ipv4.vs.conn_reuse_mode=1(default) and net.ipv4.vs.conn_reuse_old_conntrack=1(default). As the net.ipv4.vs.conn_reuse_old_conntrack sysctl switch is newly added. You can adapt the kube-proxy by judging whether there is net.ipv4.vs.conn_reuse_old_conntrack, if so, it means that the current kernel is the version that fixed this bug. That Can solve the following problems:

  1. Rolling update, IPVS keeps scheduling traffic to the destroyed Pod
  2. Unbalanced IPVS traffic scheduling after scaled up or rolling update
  3. fix IPVS low throughput issue #71114 https://github.com/kubernetes/kubernetes/pull/71114
  4. One second connection delay in masque https://marc.info/?t=151683118100004&r=1&w=2
  5. IPVS low throughput #70747 https://github.com/kubernetes/kubernetes/issues/70747
  6. Apache Bench can fill up ipvs service proxy in seconds #544 https://github.com/cloudnativelabs/kube-router/issues/544
  7. Additional 1s latency in host -> service IP -> pod when upgrading from 1.15.3 -> 1.18.1 on RHEL 8.1 #90854 https://github.com/kubernetes/kubernetes/issues/90854
  8. kube-proxy ipvs conn_reuse_mode setting causes errors with high load from single client #81775 https://github.com/kubernetes/kubernetes/issues/81775

Thank you. By Yang Yuxi (TencentCloudContainerTeam)

We also met this problem when deployment rolling update, after a deep dive, I found the root cause.

What We Met

Client in cluster sends a lot of requests to the service(ipvs mode), when the dpeloyment corresponding to the service is rolling update, “No route to host” happens sometime, even with preStop (leave time for kube-proxy to sync ipvs rules).

Deep Dive

“No route to host” means that the IP was not exists any more, could the request be sent to the pod that has already been destroyed?
I try to do some experiment on this, I found that when the new pod is ready, its’ pod ip:port was added to the ipvs rules, and weight to 1, the old pod ip:port(terminating pod) weight to 0:

root@VM-0-3-ubuntu:~# ipvsadm -Ln -t 172.16.255.241:80
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  172.16.255.241:80 rr
  -> 172.16.8.106:80              Masq    0      10         14048
  -> 172.16.8.107:80              Masq    1      0          243

And most of the InActConn is in TIME_WAIT state which expires in 2 minutes, because client requests service with short lived connections, close connection after the request completed, and then the connection will turn to TIME_WAIT state, wait for 2 minutes(2*MSL) before deleting the connection. The old rs will not be completely kicked off until all its connections are cleaned up(InActConn+ActiveConn=0).

All of above is expected, but then I found one thing is not expected, there were some new connection still been forwarded to old rs 172.16.8.106:80 which corresponding pod already been destroyed and its’ weight is 0:

root@VM-0-3-ubuntu:~# tcpdump -i eth0 host 172.16.8.106 -n -tttt
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
2019-12-13 11:49:47.319093 IP 10.0.0.3.36708 > 172.16.8.106.80: Flags [S], seq 3988339656, win 29200, options [mss 1460,sackOK,TS val 3751111666 ecr 0,nop,wscale 9], length 0
2019-12-13 11:49:47.319133 IP 10.0.0.3.36706 > 172.16.8.106.80: Flags [S], seq 109196945, win 29200, options [mss 1460,sackOK,TS val 3751111666 ecr 0,nop,wscale 9], length 0
2019-12-13 11:49:47.319144 IP 10.0.0.3.36704 > 172.16.8.106.80: Flags [S], seq 1838682063, win 29200, options [mss 1460,sackOK,TS val 3751111666 ecr 0,nop,wscale 9], length 0
2019-12-13 11:49:47.319153 IP 10.0.0.3.36702 > 172.16.8.106.80: Flags [S], seq 1591982963, win 29200, options [mss 1460,sackOK,TS val 3751111666 ecr 0,nop,wscale 9], length 0
2019-12-13 11:50:10.158452 IP 10.0.0.3.37878 > 172.16.8.106.80: Flags [S], seq 4124445126, win 29200, options [mss 1460,sackOK,TS val 3751134506 ecr 0,nop,wscale 9], length 0

I captured the SYN packets which will never receive ACK, because the old pod was already gone, pod ip was not exsists anymore, and I can also capture the icmp packets which tells that the ip was unreacheable (“No route to host”).

So the question is that why try to establish a connection with rs with weight 0? Is there a bug in the scheduling algorithm of the ipvs kernel module? After read the scheduling logic of the linux ipvs source code, I did not find that it would be scheduled to rs with weight 0.

Maybe it’s not a problem about scheduling? Then I try to read more code, mainly ip_vs_in function in net/netfilter/ipvs/ip_vs_core.c, I found that it will check if the packet belongs to an existing connection entry, if it is, just forward packet to the rs corresponding to the existing connection, otherwise try to schedule.

Maybe the new connetion matched the existing connection for some reason? I try to modify the source code: If the weight of rs corresponding to the matched connection is 0, then try to reschedule. Compile and reload the kernel module, and then retest. I found that the problem was solved ! “No route to host” never happen again. But if we do this, graceful termination will not be supported anymore, because there were no more packets been forwarded to the rs if its’ weight is 0. So, this is not a good solution.

But why the new connection matched the old connection? Does the client uses the same source port both in new connection and old connection? After some thought, I think it is possible: client send a lot of requests, using a large number of source ports, when the source port is not enough, client reuses source port which belongs to the connection in TIME_WAIT state, and then the five-tuple of the new SYN packet matched the old connection, ipvs treat it as an old connection, and then forward it to the old rs which already been destroyed.

How to Avoid

This problem often happens in this situation: serviceA receives request out of cluster, and it will call serviceB as a client over rpc or http. When the amount of requests becomes large, the call of serviceA to serviceB also becomes large, and “No route to host” will happens in serviceA call serviceB. We can increase the number of replicas for serviceA (client), and also add podAntiAffnity, disperse pods to different nodes, avoid source port exhaustion which lead to port reuse.

How to Solve it Completely

support configuration of kube-proxy IPVS tcp,tcpfin,udp timeout will not solve it completely, and it also has no effect on connection which is in TIME_WAIT state.

Should IPVS graceful termination ignore “inactive” connections? I think it has both advantages and disadvantages, but the advantages can outweigh the disadvantages. The advantage is that we can remove unavailable rs faster, aviod the “No route to host”, the disadvantage is that we can’t be that “graceful”, but I think it’s enough.

Following-up on @yyx’s comment above for posterity.

The above patch mentioned in https://github.com/kubernetes/kubernetes/issues/81775#issuecomment-642704084 didn’t make it to the kernel but there are two recently merged patches worth highlighting. One of them fixes the 1 second delay issue when a conntrack entry is reused and the other fixes an issue where packets are dropped when stale connection entries in the IPVS table are used:

  1. http://patchwork.ozlabs.org/project/netfilter-devel/patch/20200701151719.4751-1-ja@ssi.bg/
  2. http://patchwork.ozlabs.org/project/netfilter-devel/patch/20200708161638.13584-1-kim.andrewsy@gmail.com/

The 2nd patch in particular should help in cases where there is high load from a single client as described in the original issue description.

Once KEP-1672 in https://github.com/kubernetes/enhancements/pull/1607 is approved, we should be able to start tracking terminating state of endpoints which means we can delete a real server as soon as the pod is terminated. This should mean the # of “inactive” conntrack entries (mostly connections in TIME_WAIT) will be significantly less. I’m hoping this should alleviate most of the issues we’re seeing with source port conflicts without losing the graceful termination functionality. I’ll start working on a PR once the KEP is approved.

hi @njuicsgz
I have a doubt with you test that you resolved this problem just by deleting rs immediately when pod enters Terminating state. Can you tell your ipvs sysctl setting ?

according to my understanding , kube-proxy need to set net.ipv4.vs.conntrack=1 to use masquarade and set net.ipv4.vs.conn_reuse_mode=0 . BTW, if set net.ipv4.vs.conn_reuse_mode=1 , this failover issue will not produce but the 1s delayed response will produce

I have do the following test:

1 setup a ipvs node , and set net.ipv4.vs.conntrack=1 and net.ipv4.vs.conn_reuse_mode=0

2 configure two real server r1 and r2 to ipvs with NAT mode , and test them working ok

3 from a client node , try to " curl http://vip:vip_port --local-port 999" . this request get response correctly . And I can see this tcp session stay on WAIT_TIMEOUT on the ipvs node. And Using ipvsadm -ln , find the request was handled by r1

4 within the WAIT_TIMEOUT seconds , I try to set the rs1 to weight 0 on ipvs node ( also try to remove rs1 ). (I also test to keep or destroy the server of the real r1 )

5 wthin the WAIT_TIMEOUT seconds , I use the same client node to " curl http://vip:vip_port --local-port 999 " again , this request get response, but from the ipvs node , I found the ActiveConn and InActConn of rs2 did not increase , and the time_out counter of original tcp session between client and rs1 was updated to a big one . This improves that the new request from the same client was still handled by the remove real server r1

So , in a word , my test did not verify your solution

According to my understanding, when net.ipv4.vs.conntrack=1 and net.ipv4.vs.conn_reuse_mode=0 (This will effectively disable expire_nodest_conn) , if the client reuse the port and hit the WAIT_TIMEOUT socket, ipvs will not schedule a new real server , even when the real server was removed or set to weight 0

Your solution seems to just remove the pod ahead when deleting , but the removed pod can continue to serve for the ipvs client for the graceful termination period. If the client continue to reuse port to communicate after graceful termination period , the client will get no response . The timing of removing pod from ipvs seems not to solve the root reason.

if my words wrong, please tell me , thanks

I met this problem in my product env – my envoy sends http short-lived requests to k8s service address with speed 2000qps.

Have tried net.ipv4.vs.conn_reuse_mode=1, could resolve the problem but get 1s delay for all requests since envoy reused ports. As metioned by @juliantaylor

#81962 (hard remove rs) only works after 5min(or 15min), but within this period, requests also route to the rs whose ipvs weight = 0.

so, I think if we could just clear the inactive conntracks who bound to the rs address? This makes new request from envoy (reusing port) to get a normal rs. As while it has no 1s delay performance issue. And #81962 is also needed. Considering that it comes new request before we clear inactive conntracks, which makes the activeConn is not 0 (it stays SYN_RCVD state).

I have resolved this problem just by deleting rs immediately when pod enters Terminating state.

Code diff like this:

@@ -168,10 +168,12 @@ func (m *GracefulTerminationManager) deleteRsFunc(rsToDelete *listItem) (bool, e
                        // For UDP traffic, no graceful termination, we immediately delete the RS
                        //     (existing connections will be deleted on the next packet because sysctlExpireNoDestConn=1)
                        // For other protocols, don't delete until all connections have expired)
-                       if strings.ToUpper(rsToDelete.VirtualServer.Protocol) != "UDP" && rs.ActiveConn+rs.InactiveConn != 0 {
+                       /*
+                       if strings.ToUpper(rsToDelete.VirtualServer.Protocol) != "UDP" && rs.ActiveConn+rs.InactiveConn != 0 {
                                klog.Infof("Not deleting, RS %v: %v ActiveConn, %v InactiveConn", rsToDelete.String(), rs.ActiveConn, rs.InactiveConn)
                                return false, nil
                        }
+                       */
                        klog.V(2).Infof("Deleting rs: %s", rsToDelete.String())
                        err := m.ipvs.DeleteRealServer(rsToDelete.VirtualServer, rs)
                        if err != nil {

I test long request which needs 60s to response, the response can be correctly back if the endpoint is deleted from ipvs service. It seems everything works well, thought I want to known what’s the disadvantage? If there is any please let me known, thanks. @juliantaylor @andrewsykim @lbernail

My test

t:
# ipvsadm -ln | grep 192.168.0.113:40080 -A2
TCP  192.168.0.113:40080 rr
  -> 172.16.94.5:80               Masq        1      0          0
# curl http://100.94.28.59:40080/sleep?second=60  -- will response after 60s

t+10s:
# ipvsadm -ln | grep 192.168.0.113:40080 -A2
TCP  192.168.0.113:40080 rr
  -> 172.16.94.5:80               Masq        1      1          0
# kubectl apply -f rc.yaml  -- create new pods, and delete olds

t+30s:
# ipvsadm -ln | grep 192.168.0.113:40080 -A2
TCP  192.168.0.113:40080 rr
  -> 172.16.28.7:80               Masq        1      0          0

t+60s:
# curl http://100.94.28.59:40080/sleep?second=60
{"kind": "Status", "code": 200, "message": ""}

// of cource pods graceful_delete_period should be more than 60s which makes sure it is still alive

Please test with PR https://github.com/kubernetes/kubernetes/pull/102122 applied and kernel >= 5.9 if possible.

I found same problem when rolling update coredns. The connection which has no destination will be expired when “first” packet is received and this packet will be dropped.

	/* Check the server status */
	if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
		/* the destination server is not available */

		__u32 flags = cp->flags;

		/* when timer already started, silently drop the packet.*/
		if (timer_pending(&cp->timer))
			__ip_vs_conn_put(cp);
		else
			ip_vs_conn_put(cp);

		if (sysctl_expire_nodest_conn(ipvs) &&
		    !(flags & IP_VS_CONN_F_ONE_PACKET)) {
			/* try to expire the connection immediately */
			ip_vs_conn_expire_now(cp);
		}

		return NF_DROP;
	}

I think the “first” packet should be resend to another real server but not just be dropped.

https://github.com/kubernetes/kubernetes/pull/102122 is now merged and should be included in v1.22.

This should be fixed with kernel >= 5.9. References to the relevant kernel patches are in the https://github.com/kubernetes/kubernetes/issues/93297.

Please reopen this issue if the problem persists with https://github.com/kubernetes/kubernetes/pull/102122 and kernel >= 5.9.

/close

It sounds like the desired behavior for IPVS graceful termination is largely dependant on how applications handle graceful termination. Wondering if we should just make the graceful period configurable and call it a day. My question then would be if we need to make it configurable at the kube-proxy level or per Service. Hard coding it at 15m is clearly problematic

@weizhouBlue Yes, net.ipv4.vs.expire_nodest_conn=1 is exactly the key factor.

So I agree that we should just remove the RS from ipvs service.

@lbernail @andrewsykim what do you think about this? If there is no more comments I’d like to make a new PR.

hi @njuicsgz
I have found the key Although the IPVS doc writes that net.ipv4.vs.conn_reuse_mode=0 will effectively disable expire_nodest_conn , it performs differently with below two case:

1 when setting the RS to weight=0, regardless of the value of net.ipv4.vs.expire_nodest_conn , the problem exists 2 when deleting the RS from the ipvs, set net.ipv4.vs.expire_nodest_conn =1 can solve the issue , but settting net.ipv4.vs.expire_nodest_conn =0 could not help . BTW, by deleting the RS , the port reused request has 1s delay issue because of scheduling new RS

so ,the best configuration is net.ipv4.vs.conntrack=1 net.ipv4.vs.conn_reuse_mode=0 net.ipv4.vs.expire_nodest_conn=1 and just possibly remove the RS from the ipvs, not setting the weight=0

@andrewsykim yes, we definitely need a hard timeout. 5mn sounds good but some users may have pods remaining in terminating status for a long time?

@juliantaylor I assume TIME_WAIT in the fortio pod is not used or decreased a lot? In that case it can definitely interact badly with the IPVS fin_wait state. I’m not sure conn_reuse_mode=0 is the best default: it increases performances at the cost of possibly creating weird TCP situations such as this one