kubernetes: UDP traffic to Loadbalancer IP fails after node reboot (mode ipvs)

What happened?

We have a scaled set up with 159 nodes, ~11k pods and ~1500 services(k8s in ipvs mode), there on every node restart we are seeing many client UDP requests to multiple LBIP:port serving UDP traffic is getting black holed because of invalid conntrack entry. We are running in k8s version 1.23.3, so I suppose all the latest fixes for conntrack issues should be available. The UDP clients which access the LB ingress are pods in the cluster itself. The backend pod which serves UDP was not restarted(backend pod was not in restarted node) and hence was always READY, the client pod accessing the LB IP was restarted because of node restart. The client udp pods keeps on retrying connection with the same udp source port.

I am attaching the proxy log(versbose 7) for the restarted node and the conntrack events with timestamp with this issue.

From the logs, looks like the packet came well after the rules were programmed:

Invalid conntrack: [1644409107.142928] [NEW] udp 17 30 src=50.117.79.3 dst=64.64.64.59 sport=42467 dport=8890 [UNREPLIED] src=64.64.64.59 dst=50.117.79.3 sport=8890 dport=42467 Timestamp above translates to : GMT: Wednesday, February 9, 2022 12:18:27.142 PM LB IP:64.64.64.59 , udp client pod:50.117.79.3
If you see the 2nd tuple(reverse direction) does not have NAT’ed entries(but from proxy logs looks like by that time ipvs rules were already written).

From the proxy log: I0209 12:17:52.516740 1 endpointslicecache.go:358] “Setting endpoints for service port name” portName=“bat-t2/cnf-cp-sfs-t2-46-net-udp:tapp-udp” endpoints=[192.168.11.237:8890 192.168.110.139:8890 192.168.135.77:8890 192.168.183.209:8890 192.168.200.199:8890 192.168.214.234:8890 192.168.215.32:8890 192.168.219.117:8890 192.168.219.222:8890 192.168.220.125:8890 192.168.247.165:8890 192.168.248.97:8890 192.168.250.78:8890 192.168.255.76:8890 192.168.37.61:8890 192.168.55.131:8890 192.168.57.24:8890 192.168.59.101:8890 192.168.7.62:8890 192.168.88.125:8890]

I0209 12:17:56.325097 1 proxier.go:1972] “Adding new service” serviceName=“bat-t2/cnf-cp-sfs-t2-46-net-udp:tapp-udp” virtualServer=“64.64.64.59:8890/UDP” I0209 12:17:56.325126 1 proxier.go:1996] “Bind address” address=“64.64.64.59” I0209 12:18:01.339193 1 ipset.go:176] “Successfully added ip set entry to ip set” ipSetEntry=“64.64.64.59,udp:8890” ipSet=“KUBE-LOAD-BALANCER” I0209 12:18:04.413562 1 conntrack.go:66] Clearing conntrack entries [-D --orig-dst 64.64.64.59 -p udp]

I0209 12:18:11.338453 1 proxier.go:1008] “syncProxyRules complete” elapsed=“18.876838575s”

So the first packet came at 12:18:27.142 but the rules for the same service with endpoints were programmed well before at 12:18:11.338453. And there were no other events related to this service/endpoint until the first packet came.

In /var/log/messages saw below message, but not sure if it could have an impact as some other services were working: 2022-02-09T13:17:52.573177+01:00 pool08-n108-wk08-n053 systemd-udevd[14685]: Could not generate persistent MAC address for kube-ipvs0: No such file or directory

Output of ipvsadm command: UDP 05:00 UDP 50.117.79.3:42467 64.64.64.59:8890 192.168.88.125:8890

UDP 64.64.64.59:8890 rr -> 192.168.7.62:8890 Masq 1 0 1
-> 192.168.11.237:8890 Masq 1 0 0
-> 192.168.37.61:8890 Masq 1 0 1
-> 192.168.55.131:8890 Masq 1 0 1
-> 192.168.57.24:8890 Masq 1 0 1
-> 192.168.59.101:8890 Masq 1 0 1
-> 192.168.88.125:8890 Masq 1 0 1
-> 192.168.110.139:8890 Masq 1 0 0
-> 192.168.135.77:8890 Masq 1 0 0
-> 192.168.183.209:8890 Masq 1 0 0
-> 192.168.200.199:8890 Masq 1 0 0
-> 192.168.214.234:8890 Masq 1 0 0
-> 192.168.215.32:8890 Masq 1 0 0
-> 192.168.219.117:8890 Masq 1 0 0
-> 192.168.219.222:8890 Masq 1 0 0
-> 192.168.220.125:8890 Masq 1 0 0
-> 192.168.247.165:8890 Masq 1 0 0
-> 192.168.248.97:8890 Masq 1 0 0
-> 192.168.250.78:8890 Masq 1 0 0
-> 192.168.255.76:8890 Masq 1 0 0

The above output shows ipvs vip had one hit matching the packet with invalid conntrack (inactconn is 1).

Attached logs(kubeproxy log for rebooted node and conntrack event log for rebooted node with timestamp): reboot_proxy.log conntrack_event_log_worker.log

What did you expect to happen?

After node reboot all UDP traffic should reach backend pod successfully through LB ingress.

How can we reproduce it (as minimally and precisely as possible)?

We are not sure of the actual steps to reproduce, but we are seeing this in the scaled environment after every node restart.

Below is our setup:

~> kubectl get nodes | wc -l 159 :~> kubectl get pods -A| wc -l 11197 :~> kubectl get svc -A| wc -l 1492

Out of these around 800 are UDP services.
The client pods accesses the UDP server through LB ingress IP.
Restart one of the nodes and when the server is back in running state, there would be many invalid conntrack entries blackholing the traffic

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
# paste output here
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.3", GitCommit:"60a539cdd7ac8ea7a62b7c3bd1d3c374529788cb", GitTreeState:"clean", BuildDate:"2022-01-26T06:28:20Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.3", GitCommit:"60a539cdd7ac8ea7a62b7c3bd1d3c374529788cb", GitTreeState:"clean", BuildDate:"2022-01-26T06:18:30Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

openstack

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
NAME="SLES"
VERSION="15-SP2"
VERSION_ID="15.2"
PRETTY_NAME="SUSE Linux Enterprise Server 15 SP2"
ID="sles"
ID_LIKE="suse"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:15:sp2"

$ uname -a
# paste output here
Linux control-plane-n108-mast-n057 5.3.18-24.99-default #1 SMP Sun Jan 23 19:03:51 UTC 2022 (712a8e6) x86_64 x86_64 x86_64 GNU/Linux

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Container runtime (CRI) and and version (if applicable)

Related plugins (CNI, CSI, …) and versions (if applicable)

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 21 (12 by maintainers)

Most upvoted comments

Would be getting access to the setup and reproduce this, will add more details after the debugging. I guess the triaging could be paused until then as there should be further clarity on the steps.

VivekThrivikraman-est on Feb 18, 2022