kubernetes: kube-proxy ipvs conn_reuse_mode setting causes errors with high load from single client
What happened:
The kube-proxy sets /proc/sys/net/ipv4/vs/conn_reuse_mode
to zero
https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/ipvs/proxier.go#L340
This has the effect that when:
- kube-proxy uses ipvs mode and an ipvs virtualserver is configured with only a few realservers (pod replicas)
- a client sends a lot of requests so that the source ports are reused before the MLS expires
- a pod gets removed(e.g. via upgrade) removed while it receives traffic from this client
What happens in this situation is that the kube-proxy correctly sets the wait of the realserver to 0 and creates a new realserver with for the new pod that replaces the removed pod.
The problem is that the kube-proxy will not remove the weight zero realserver until its connections drop to zero.
This will never happen as the conn_reuse_mode
is set to 0 and the client reuses its source ports. This causes the kernel to constantly reuse connections and send the traffic to the weight 0 realserver.
What you expected to happen: When the old pod is removed its realserver receives no traffic anymore and is removed by kube-proxy
How to reproduce it (as minimally and precisely as possible): Deploy a service with with only a few replica pods as endpoints.
start sending lots of traffic to that service from a single client so the client will reuse its source ports sooner than the maximum segment lifetime of tcp connections. For example with fortio:
fortio load -t 0 -qps 1000 -c 16 -keepalive=0 http://10.86.6.96
delete a pod on the node.
Fortio will now produce following errors occassionally:
08:50:10 E http_client.go:558> Unable to connect to 10.86.6.96:80 : dial tcp 10.86.6.96:80: connect: no route to host
08:50:11 E http_client.go:558> Unable to connect to 10.86.6.96:80 : dial tcp 10.86.6.96:80: connect: no route to host
08:50:12 E http_client.go:558> Unable to connect to 10.86.6.96:80 : dial tcp 10.86.6.96:80: connect: no route to host
08:50:13 E http_client.go:558> Unable to connect to 10.86.6.96:80 : dial tcp 10.86.6.96:80: connect: no route to host
ipvs will look like this:
ipvsadm -l -t 10.86.6.96:80 -n
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 10.86.6.96:80 rr
-> 100.66.61.116:8000 Masq 1 0 5134
-> 100.69.146.151:8000 Masq 0 76 3956
-> 100.69.146.185:8000 Masq 1 0 967
kube-proxy will never remove the backends due to existing connections:
I0822 09:17:35.283821 1 graceful_termination.go:172] Not deleting, RS 10.86.6.96:80/TCP/100.69.146.151:8000: 33 ActiveConn, 38296 InactiveConn
I0822 09:18:35.283996 1 graceful_termination.go:161] Trying to delete rs: 10.86.6.96:80/TCP/100.69.146.151:8000
I0822 09:18:35.284111 1 graceful_termination.go:172] Not deleting, RS 10.86.6.96:80/TCP/100.69.146.151:8000: 76 ActiveConn, 11271 InactiveConn
I0822 09:19:35.284242 1 graceful_termination.go:161] Trying to delete rs: 10.86.6.96:80/TCP/100.69.146.151:8000
I0822 09:19:35.284324 1 graceful_termination.go:172] Not deleting, RS 10.86.6.96:80/TCP/100.69.146.151:8000: 58 ActiveConn, 0 InactiveConn
Anything else we need to know?:
setting the conn_reuse_mode to 1 fixes the problem, though that has performance impact (see #70747 and https://marc.info/?l=linux-virtual-server&m=151706660530133&w=2) maybe #81308 helps too using keepalive on the client also avoids this problem, though one may not be able to control all clients.
Environment:
-
Kubernetes version (use
kubectl version
): Client Version: version.Info{Major:“1”, Minor:“15”, GitVersion:“v1.15.0”, GitCommit:“e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529”, GitTreeState:“clean”, BuildDate:“2019-06-19T16:40:16Z”, GoVersion:“go1.12.5”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“14”, GitVersion:“v1.14.5”, GitCommit:“0e9fcb426b100a2aea5ed5c25b3d8cfbb01a8acf”, GitTreeState:“clean”, BuildDate:“2019-08-05T09:13:08Z”, GoVersion:“go1.12.5”, Compiler:“gc”, Platform:“linux/amd64”} -
hardware
-
"Container Linux by CoreOS 2135.6.0 (Rhyolite)
-
kernel 4.19.56-coreos-r1
-
Install tools:
-
calico 3.8.2
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 14
- Comments: 59 (44 by maintainers)
Commits related to this issue
- ipvs: avoid drop first packet by reusing conntrack Since 'commit f719e3754ee2 ("ipvs: drop first packet to redirect conntrack")', when a new TCP connection meet the conditions that need reschedule, t... — committed to 0day-ci/linux by yyx 4 years ago
- ipvs: avoid drop first packet by reusing conntrack Since 'commit f719e3754ee2 ("ipvs: drop first packet to redirect conntrack")', when a new TCP connection meet the conditions that need reschedule, t... — committed to 0day-ci/linux by yyx 4 years ago
- ipvs: avoid drop first packet by reusing conntrack Since 'commit f719e3754ee2 ("ipvs: drop first packet to redirect conntrack")', when a new TCP connection meet the conditions that need reschedule, t... — committed to 0day-ci/linux by yyx 4 years ago
- ipvs: avoid drop first packet by reusing conntrack Since 'commit f719e3754ee2 ("ipvs: drop first packet to redirect conntrack")', when a new TCP connection meet the conditions that need reschedule, t... — committed to Tencent/TencentOS-kernel by yyx 4 years ago
- ipvs: avoid drop first packet by reusing conntrack Since 'commit f719e3754ee2 ("ipvs: drop first packet to redirect conntrack")', when a new TCP connection meet the conditions that need reschedule, t... — committed to Tencent/TencentOS-kernel by yyx 4 years ago
- TencentOS-kernel: ipvs: avoid drop first packet by reusing conntrack fix #29256237 commit a01a9445c00eca3e37523eb6b0d87f494eceeb4b TencentOS-kernel Since 'commit f719e3754ee2 ("ipvs: drop first pac... — committed to alibaba/cloud-kernel by yyx 4 years ago
- TencentOS-kernel: ipvs: avoid drop first packet by reusing conntrack fix #29256237 commit a01a9445c00eca3e37523eb6b0d87f494eceeb4b TencentOS-kernel Since 'commit f719e3754ee2 ("ipvs: drop first pac... — committed to alibaba/cloud-kernel by yyx 4 years ago
Hello everyone: We are very fortunate to tell you that this bug has been fixed by us and has been verified to work very well. The patch(ipvs: avoid drop first packet by reusing conntrack) is being submitted to the Linux kernel community. You can also apply this patch to your own kernel, and then only need to set net.ipv4.vs.conn_reuse_mode=1(default) and net.ipv4.vs.conn_reuse_old_conntrack=1(default). As the net.ipv4.vs.conn_reuse_old_conntrack sysctl switch is newly added. You can adapt the kube-proxy by judging whether there is net.ipv4.vs.conn_reuse_old_conntrack, if so, it means that the current kernel is the version that fixed this bug. That Can solve the following problems:
host -> service IP -> pod
when upgrading from1.15.3 -> 1.18.1
on RHEL 8.1 #90854 https://github.com/kubernetes/kubernetes/issues/90854Thank you. By Yang Yuxi (TencentCloudContainerTeam)
We also met this problem when deployment rolling update, after a deep dive, I found the root cause.
What We Met
Client in cluster sends a lot of requests to the service(ipvs mode), when the dpeloyment corresponding to the service is rolling update, “No route to host” happens sometime, even with preStop (leave time for kube-proxy to sync ipvs rules).
Deep Dive
“No route to host” means that the IP was not exists any more, could the request be sent to the pod that has already been destroyed?
I try to do some experiment on this, I found that when the new pod is ready, its’ pod ip:port was added to the ipvs rules, and weight to 1, the old pod ip:port(terminating pod) weight to 0:
And most of the
InActConn
is inTIME_WAIT
state which expires in 2 minutes, because client requests service with short lived connections, close connection after the request completed, and then the connection will turn toTIME_WAIT
state, wait for 2 minutes(2*MSL) before deleting the connection. The old rs will not be completely kicked off until all its connections are cleaned up(InActConn+ActiveConn=0).All of above is expected, but then I found one thing is not expected, there were some new connection still been forwarded to old rs
172.16.8.106:80
which corresponding pod already been destroyed and its’ weight is 0:I captured the SYN packets which will never receive ACK, because the old pod was already gone, pod ip was not exsists anymore, and I can also capture the icmp packets which tells that the ip was unreacheable (“No route to host”).
So the question is that why try to establish a connection with rs with weight 0? Is there a bug in the scheduling algorithm of the ipvs kernel module? After read the scheduling logic of the linux ipvs source code, I did not find that it would be scheduled to rs with weight 0.
Maybe it’s not a problem about scheduling? Then I try to read more code, mainly
ip_vs_in
function innet/netfilter/ipvs/ip_vs_core.c
, I found that it will check if the packet belongs to an existing connection entry, if it is, just forward packet to the rs corresponding to the existing connection, otherwise try to schedule.Maybe the new connetion matched the existing connection for some reason? I try to modify the source code: If the weight of rs corresponding to the matched connection is 0, then try to reschedule. Compile and reload the kernel module, and then retest. I found that the problem was solved ! “No route to host” never happen again. But if we do this, graceful termination will not be supported anymore, because there were no more packets been forwarded to the rs if its’ weight is 0. So, this is not a good solution.
But why the new connection matched the old connection? Does the client uses the same source port both in new connection and old connection? After some thought, I think it is possible: client send a lot of requests, using a large number of source ports, when the source port is not enough, client reuses source port which belongs to the connection in
TIME_WAIT
state, and then the five-tuple of the new SYN packet matched the old connection, ipvs treat it as an old connection, and then forward it to the old rs which already been destroyed.How to Avoid
This problem often happens in this situation: serviceA receives request out of cluster, and it will call serviceB as a client over rpc or http. When the amount of requests becomes large, the call of serviceA to serviceB also becomes large, and “No route to host” will happens in serviceA call serviceB. We can increase the number of replicas for serviceA (client), and also add podAntiAffnity, disperse pods to different nodes, avoid source port exhaustion which lead to port reuse.
How to Solve it Completely
support configuration of kube-proxy IPVS tcp,tcpfin,udp timeout will not solve it completely, and it also has no effect on connection which is in
TIME_WAIT
state.Should IPVS graceful termination ignore “inactive” connections? I think it has both advantages and disadvantages, but the advantages can outweigh the disadvantages. The advantage is that we can remove unavailable rs faster, aviod the “No route to host”, the disadvantage is that we can’t be that “graceful”, but I think it’s enough.
Following-up on @yyx’s comment above for posterity.
The above patch mentioned in https://github.com/kubernetes/kubernetes/issues/81775#issuecomment-642704084 didn’t make it to the kernel but there are two recently merged patches worth highlighting. One of them fixes the 1 second delay issue when a conntrack entry is reused and the other fixes an issue where packets are dropped when stale connection entries in the IPVS table are used:
The 2nd patch in particular should help in cases where there is high load from a single client as described in the original issue description.
Once KEP-1672 in https://github.com/kubernetes/enhancements/pull/1607 is approved, we should be able to start tracking terminating state of endpoints which means we can delete a real server as soon as the pod is terminated. This should mean the # of “inactive” conntrack entries (mostly connections in TIME_WAIT) will be significantly less. I’m hoping this should alleviate most of the issues we’re seeing with source port conflicts without losing the graceful termination functionality. I’ll start working on a PR once the KEP is approved.
hi @njuicsgz
I have a doubt with you test that you resolved this problem just by deleting rs immediately when pod enters Terminating state. Can you tell your ipvs sysctl setting ?
according to my understanding , kube-proxy need to set net.ipv4.vs.conntrack=1 to use masquarade and set net.ipv4.vs.conn_reuse_mode=0 . BTW, if set net.ipv4.vs.conn_reuse_mode=1 , this failover issue will not produce but the 1s delayed response will produce
I have do the following test:
1 setup a ipvs node , and set net.ipv4.vs.conntrack=1 and net.ipv4.vs.conn_reuse_mode=0
2 configure two real server r1 and r2 to ipvs with NAT mode , and test them working ok
3 from a client node , try to " curl http://vip:vip_port --local-port 999" . this request get response correctly . And I can see this tcp session stay on WAIT_TIMEOUT on the ipvs node. And Using ipvsadm -ln , find the request was handled by r1
4 within the WAIT_TIMEOUT seconds , I try to set the rs1 to weight 0 on ipvs node ( also try to remove rs1 ). (I also test to keep or destroy the server of the real r1 )
5 wthin the WAIT_TIMEOUT seconds , I use the same client node to " curl http://vip:vip_port --local-port 999 " again , this request get response, but from the ipvs node , I found the ActiveConn and InActConn of rs2 did not increase , and the time_out counter of original tcp session between client and rs1 was updated to a big one . This improves that the new request from the same client was still handled by the remove real server r1
So , in a word , my test did not verify your solution
According to my understanding, when net.ipv4.vs.conntrack=1 and net.ipv4.vs.conn_reuse_mode=0 (This will effectively disable expire_nodest_conn) , if the client reuse the port and hit the WAIT_TIMEOUT socket, ipvs will not schedule a new real server , even when the real server was removed or set to weight 0
Your solution seems to just remove the pod ahead when deleting , but the removed pod can continue to serve for the ipvs client for the graceful termination period. If the client continue to reuse port to communicate after graceful termination period , the client will get no response . The timing of removing pod from ipvs seems not to solve the root reason.
if my words wrong, please tell me , thanks
I met this problem in my product env – my envoy sends http short-lived requests to k8s service address with speed 2000qps.
Have tried net.ipv4.vs.conn_reuse_mode=1, could resolve the problem but get 1s delay for all requests since envoy reused ports. As metioned by @juliantaylor
#81962 (hard remove rs) only works after 5min(or 15min), but within this period, requests also route to the rs whose ipvs weight = 0.
so, I think if we could just clear the inactive conntracks who bound to the rs address? This makes new request from envoy (reusing port) to get a normal rs. As while it has no 1s delay performance issue. And #81962 is also needed. Considering that it comes new request before we clear inactive conntracks, which makes the activeConn is not 0 (it stays SYN_RCVD state).
I have resolved this problem just by deleting rs immediately when pod enters Terminating state.
Code diff like this:
I test long request which needs 60s to response, the response can be correctly back if the endpoint is deleted from ipvs service. It seems everything works well, thought I want to known what’s the disadvantage? If there is any please let me known, thanks. @juliantaylor @andrewsykim @lbernail
My test
Please test with PR https://github.com/kubernetes/kubernetes/pull/102122 applied and kernel >= 5.9 if possible.
I found same problem when rolling update coredns. The connection which has no destination will be expired when “first” packet is received and this packet will be dropped.
I think the “first” packet should be resend to another real server but not just be dropped.
https://github.com/kubernetes/kubernetes/pull/102122 is now merged and should be included in v1.22.
This should be fixed with kernel >= 5.9. References to the relevant kernel patches are in the https://github.com/kubernetes/kubernetes/issues/93297.
Please reopen this issue if the problem persists with https://github.com/kubernetes/kubernetes/pull/102122 and kernel >= 5.9.
/close
It sounds like the desired behavior for IPVS graceful termination is largely dependant on how applications handle graceful termination. Wondering if we should just make the graceful period configurable and call it a day. My question then would be if we need to make it configurable at the kube-proxy level or per Service. Hard coding it at 15m is clearly problematic
@weizhouBlue Yes, net.ipv4.vs.expire_nodest_conn=1 is exactly the key factor.
So I agree that we should just remove the RS from ipvs service.
@lbernail @andrewsykim what do you think about this? If there is no more comments I’d like to make a new PR.
hi @njuicsgz
I have found the key Although the IPVS doc writes that net.ipv4.vs.conn_reuse_mode=0 will effectively disable expire_nodest_conn , it performs differently with below two case:
1 when setting the RS to weight=0, regardless of the value of net.ipv4.vs.expire_nodest_conn , the problem exists 2 when deleting the RS from the ipvs, set net.ipv4.vs.expire_nodest_conn =1 can solve the issue , but settting net.ipv4.vs.expire_nodest_conn =0 could not help . BTW, by deleting the RS , the port reused request has 1s delay issue because of scheduling new RS
so ,the best configuration is net.ipv4.vs.conntrack=1 net.ipv4.vs.conn_reuse_mode=0 net.ipv4.vs.expire_nodest_conn=1 and just possibly remove the RS from the ipvs, not setting the weight=0
@andrewsykim yes, we definitely need a hard timeout. 5mn sounds good but some users may have pods remaining in terminating status for a long time?
@juliantaylor I assume TIME_WAIT in the fortio pod is not used or decreased a lot? In that case it can definitely interact badly with the IPVS fin_wait state. I’m not sure
conn_reuse_mode=0
is the best default: it increases performances at the cost of possibly creating weird TCP situations such as this one