cilium: cilium bpf conntrack issue, packets routed to wrong service endpoint
Is there an existing issue for this?
- I have searched the existing issues
What happened?
TL;DR: on Digitalocean Kubernetes (on which I am a product engineer), I have observed a condition where packets intended for a kubernetes Service endpoint (10.245.0.1:443) actually gets routed to a different service endpoint (in my experience, they get routed to one of two CoreDNS pods at port 53).
The issue is kind of tricky to reproduce, and impossible without access to backend Digitalocean Kubernetes admininstrative tools. Essentially, the way to reproduce it is:
- Create a Digitalocean cluster.
- Run pods on worker nodes (the more worker nodes, the more likelihood to reproduce the condition) that continuously try to establish TLS connections with the apiserver. It’s not necessary to actually make any HTTP requests once the TLS connection is established because, at least in the scenarios i have observed, the problem is not that no connection can be made whatsoever but that a TCP connection can be established and in some cases the TLS negotiation will fail as if the remote server doesn’t actually speak TLS.
- “Recycle” (ie shut down, destroy, and replace) the control plane VM where the apiserver runs
- Check all reproduction pod logs for messages indicating failed TLS negotiation.
- If 5-10 minutes after the master node finishes recycling and the new apiserver is up there are no connection failures indicated in pod logs then assume the issue hasn’t been reproduced, try recycling the control plane VM again.
diagnosing the issue once it’s reproduced
There are two ways we have determined that what’s happening here is incorrect:
- By enabling verbose cilium datapath logging via
cilium monitoron worker nodes with affected pods, then observing datapath logs as we manually attempt to reproduce the connection failure usingcurloropenssl s_client. - By observing packets using
tcpdumpboth the public and the VPC network interfaces of a worker node with an affected pod (again, while we manually attempt to reproduce the connection failure usingcurloropenssl s_client).
I’ll break this down into four demonstrations:
- first, i’ll show what “normal” apiserver connection via the
10.245.0.1endpoint looks like in the tcpdump output - second, i’ll show what “normal” apiserver connection via the
10.245.0.1endpoint looks like in thecilium monitoroutput - third, i’ll show what a failed connection looks like in tcpdump output
- fourth, i’ll show what a failed connection looks like in
cilium monitoroutput
For all the demonstrations I’ll be running curl --insecure https://10.245.0.1. When it succeeds curl output looks like the following:
root@connection-tester-zjrgv:~# curl --insecure https://10.245.0.1:443
{
"kind": "Status",
"apiVersion": "v1",
"metadata": {
},
"status": "Failure",
"message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
"reason": "Forbidden",
"details": {
},
"code": 403
}
when connections fail in the way described below it looks like
root@connection-tester-zjrgv:~# curl --insecure https://10.245.0.1:443
curl: (35) error:1408F10B:SSL routines:ssl3_get_record:wrong version number
My interpretation of this particular error is that whatever curl is connecting to, it’s not a proper TLS server (which makes sense, since it’s erroneously connecting to coredns)
failed connection, tcpdump output
In this demo I will run the following in a broken pod’s network namespace
curl --insecure https://10.245.0.1
while running the following in the worker node’s network namespace:
tcpdump -i eth1 '(port 53 or 9153) and not (port 43698 or 43694 or 43692 or 43690 or 43706 or 43696 or 43704 or 43700 or 43702 or 43668)'
(the additional excluded ports are connections that are established by my connection tester tool that exports connection fail/success metrics to help me visualize the problem; i’ve used kill -STOP <pid> to pause the process in order to reduce noise on the network interfaces, but it looks like TCP connections are being kept alive by the kernel and the remote endpoint which i can clearly see is a coredns pod even though it should be 10.245.0.1)
This tcpdump command output shows the packets erroneously routed over the internal/VPC interface targeting a CoreDNS pod at port:
06:25:16.925354 IP 10.244.4.203.44748 > 10.244.0.58.9153: Flags [S], seq 1349972120, win 64240, options [mss 1460,sackOK,TS val 1975424518 ecr 0,nop,wscale 7], length 0
06:25:16.926987 IP 10.244.0.58.9153 > 10.244.4.203.44748: Flags [S.], seq 573747032, ack 1349972121, win 65160, options [mss 1460,sackOK,TS val 1682153908 ecr 1975424518,nop,wscale 7], length 0
06:25:16.927252 IP 10.244.4.203.44748 > 10.244.0.58.9153: Flags [.], ack 1, win 502, options [nop,nop,TS val 1975424520 ecr 1682153908], length 0
06:25:16.968220 IP 10.244.4.203.44748 > 10.244.0.58.9153: Flags [P.], seq 1:518, ack 1, win 502, options [nop,nop,TS val 1975424561 ecr 1682153908], length 517
06:25:16.968938 IP 10.244.0.58.9153 > 10.244.4.203.44748: Flags [.], ack 518, win 506, options [nop,nop,TS val 1682153950 ecr 1975424561], length 0
06:25:16.969067 IP 10.244.0.58.9153 > 10.244.4.203.44748: Flags [P.], seq 1:104, ack 518, win 506, options [nop,nop,TS val 1682153950 ecr 1975424561], length 103
06:25:16.969118 IP 10.244.0.58.9153 > 10.244.4.203.44748: Flags [F.], seq 104, ack 518, win 506, options [nop,nop,TS val 1682153950 ecr 1975424561], length 0
06:25:16.969280 IP 10.244.4.203.44748 > 10.244.0.58.9153: Flags [.], ack 104, win 502, options [nop,nop,TS val 1975424562 ecr 1682153950], length 0
06:25:17.011367 IP 10.244.4.203.44748 > 10.244.0.58.9153: Flags [.], ack 105, win 502, options [nop,nop,TS val 1975424604 ecr 1682153950], length 0
06:25:17.060411 IP 10.244.4.203.44748 > 10.244.0.58.9153: Flags [R.], seq 518, ack 105, win 502, options [nop,nop,TS val 1975424653 ecr 1682153950], length 0
This output immediately follows running curl --insecure https://10.245.0.1.
To verify this is indeed a coredns pod:
> kubectl get -n kube-system ciliumendpoints | grep 10.244.0.58
coredns-85d9ccbb46-p889v 1500 61951 ready 10.244.0.58
failed connection, cilium monitor output
To demonstrate this issue using cilium monitor with datapath logging enabled I run the following command:
kubectl -n kube-system exec -ti cilium-hws2v -c cilium-agent -- cilium monitor -v --related-to 3581
cilium-hws2vis the cilium agent pod on the worker node where i have reproduced this issue3581is the endpoint id of the pod whose namespaces i am runningcurl --insecure https://10.245.0.1in
The output when I run curl --insecure https://10.245.0.1 looks like:
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: Conntrack lookup 1/2: src=10.244.4.203:44896 dst=10.245.0.1:443
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=4
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: CT entry found lifetime=672861, revnat=1
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: CT verdict: Reply, revnat=1
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: Conntrack lookup 1/2: src=10.244.4.203:44896 dst=10.244.0.58:9153
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=1
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: CT verdict: New, revnat=0
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: Successfully mapped addr=10.244.0.58 to identity=61951
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: Conntrack create: proxy-port=0 revnat=1 src-identity=20254 lb=10.244.0.58
-> stack flow 0x414828d6 identity 20254->61951 state new ifindex 0 orig-ip 0.0.0.0: 10.244.4.203:44896 -> 10.244.0.58:9153 tcp SYN
-> 0: 10.244.4.203:44896 -> 10.244.0.58:9153 tcp SYN
CPU 01: MARK 0x0 FROM 3581 DEBUG: Conntrack lookup 1/2: src=10.244.0.58:9153 dst=10.244.4.203:44896
CPU 01: MARK 0x0 FROM 3581 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=0
CPU 01: MARK 0x0 FROM 3581 DEBUG: CT entry found lifetime=653712, revnat=1
CPU 01: MARK 0x0 FROM 3581 DEBUG: CT verdict: Reply, revnat=1
CPU 01: MARK 0x0 FROM 3581 DEBUG: Reverse NAT lookup, index=1
CPU 01: MARK 0x0 FROM 3581 DEBUG: Performing reverse NAT, address=10.245.0.1 port=443
-> endpoint 3581 flow 0x0 identity 61951->20254 state reply ifindex lxc39b32c9654a8 orig-ip 10.244.0.58: 10.245.0.1:443 -> 10.244.4.203:44896 tcp SYN, ACK
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: Conntrack lookup 1/2: src=10.244.4.203:44896 dst=10.245.0.1:443
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=4
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: CT entry found lifetime=675252, revnat=1
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: CT verdict: Reply, revnat=1
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: Conntrack lookup 1/2: src=10.244.4.203:44896 dst=10.244.0.58:9153
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=1
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: CT entry found lifetime=653712, revnat=1
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: CT verdict: Established, revnat=1
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: Successfully mapped addr=10.244.0.58 to identity=61951
-> stack flow 0x414828d6 identity 20254->61951 state established ifindex 0 orig-ip 0.0.0.0: 10.244.4.203:44896 -> 10.244.0.58:9153 tcp ACK
-> 0: 10.244.4.203:44896 -> 10.244.0.58:9153 tcp ACK
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: Conntrack lookup 1/2: src=10.244.4.203:44896 dst=10.245.0.1:443
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=4
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: CT entry found lifetime=675252, revnat=1
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: CT verdict: Reply, revnat=1
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: Conntrack lookup 1/2: src=10.244.4.203:44896 dst=10.244.0.58:9153
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=1
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: CT entry found lifetime=675252, revnat=1
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: CT verdict: Established, revnat=1
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: Successfully mapped addr=10.244.0.58 to identity=61951
-> stack flow 0x414828d6 identity 20254->61951 state established ifindex 0 orig-ip 0.0.0.0: 10.244.4.203:44896 -> 10.244.0.58:9153 tcp ACK
-> 0: 10.244.4.203:44896 -> 10.244.0.58:9153 tcp ACK
CPU 01: MARK 0x0 FROM 3581 DEBUG: Conntrack lookup 1/2: src=10.244.0.58:9153 dst=10.244.4.203:44896
CPU 01: MARK 0x0 FROM 3581 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=0
CPU 01: MARK 0x0 FROM 3581 DEBUG: CT entry found lifetime=675253, revnat=1
CPU 01: MARK 0x0 FROM 3581 DEBUG: CT verdict: Reply, revnat=1
CPU 01: MARK 0x0 FROM 3581 DEBUG: Reverse NAT lookup, index=1
CPU 01: MARK 0x0 FROM 3581 DEBUG: Performing reverse NAT, address=10.245.0.1 port=443
CPU 01: MARK 0x0 FROM 3581 DEBUG: Conntrack lookup 1/2: src=10.244.0.58:9153 dst=10.244.4.203:44896
CPU 01: MARK 0x0 FROM 3581 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=0
CPU 01: MARK 0x0 FROM 3581 DEBUG: CT entry found lifetime=675253, revnat=1
CPU 01: MARK 0x0 FROM 3581 DEBUG: CT verdict: Reply, revnat=1
CPU 01: MARK 0x0 FROM 3581 DEBUG: Reverse NAT lookup, index=1
CPU 01: MARK 0x0 FROM 3581 DEBUG: Performing reverse NAT, address=10.245.0.1 port=443
-> endpoint 3581 flow 0x0 identity 61951->20254 state reply ifindex lxc39b32c9654a8 orig-ip 10.244.0.58: 10.245.0.1:443 -> 10.244.4.203:44896 tcp ACK
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: Conntrack lookup 1/2: src=10.244.4.203:44896 dst=10.245.0.1:443
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=4
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: CT entry found lifetime=675253, revnat=1
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: CT verdict: Reply, revnat=1
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: Conntrack lookup 1/2: src=10.244.4.203:44896 dst=10.244.0.58:9153
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=1
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: CT entry found lifetime=675253, revnat=1
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: CT verdict: Established, revnat=1
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: Successfully mapped addr=10.244.0.58 to identity=61951
-> 0: 10.244.4.203:44896 -> 10.244.0.58:9153 tcp ACK
CPU 01: MARK 0x0 FROM 3581 DEBUG: Conntrack lookup 1/2: src=10.244.0.58:9153 dst=10.244.4.203:44896
CPU 01: MARK 0x0 FROM 3581 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=0
CPU 01: MARK 0x0 FROM 3581 DEBUG: CT entry found lifetime=675253, revnat=1
CPU 01: MARK 0x0 FROM 3581 DEBUG: CT verdict: Reply, revnat=1
CPU 01: MARK 0x0 FROM 3581 DEBUG: Reverse NAT lookup, index=1
CPU 01: MARK 0x0 FROM 3581 DEBUG: Performing reverse NAT, address=10.245.0.1 port=443
-> endpoint 3581 flow 0x0 identity 61951->20254 state reply ifindex lxc39b32c9654a8 orig-ip 10.244.0.58: 10.245.0.1:443 -> 10.244.4.203:44896 tcp ACK, FIN
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: Conntrack lookup 1/2: src=10.244.4.203:44896 dst=10.245.0.1:443
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=4
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: CT entry found lifetime=675253, revnat=1
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: CT verdict: Reply, revnat=1
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: Conntrack lookup 1/2: src=10.244.4.203:44896 dst=10.244.0.58:9153
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=1
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: CT entry found lifetime=675253, revnat=1
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: CT verdict: Established, revnat=1
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: Successfully mapped addr=10.244.0.58 to identity=61951
-> stack flow 0x414828d6 identity 20254->61951 state established ifindex 0 orig-ip 0.0.0.0: 10.244.4.203:44896 -> 10.244.0.58:9153 tcp ACK, RST
-> 0: 10.244.4.203:44896 -> 10.244.0.58:9153 tcp ACK, RST
It seems to go wrong in the first couple conntrack lookups where it seems to be translating 10.245.0.1:443 to 10.244.0.58:9153 somehow (rather than the actual public IP of the apiserver droplet):
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: Conntrack lookup 1/2: src=10.244.4.203:44896 dst=10.245.0.1:443
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=4
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: CT entry found lifetime=672861, revnat=1
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: CT verdict: Reply, revnat=1
CPU 01: MARK 0x414828d6 FROM 3581 DEBUG: Conntrack lookup 1/2: src=10.244.4.203:44896 dst=10.244.0.58:9153
normal apiserver connection, tcpdump output
In this demo i’ll run the following (port 1279 is the kubelet outbound port on my test cluster):
tcpdump -i eth0 port 443 and not port 1279
The result when I run curl --insecure https://10.245.0.1 is the following:
06:43:51.238439 IP <worker-node-public-ip>.45222 > <apiserver-public-ip>.https: Flags [S], seq 1772157021, win 64240, options [mss 1460,sackOK,TS val 1976538787 ecr 0,nop,wscale 7], length 0
06:43:51.240101 IP <apiserver-public-ip>.https > <worker-node-public-ip>.45222: Flags [S.], seq 706137170, ack 1772157022, win 65160, options [mss 1460,sackOK,TS val 1097293368 ecr 1976538787,nop,wscale 7], length 0
06:43:51.240240 IP <worker-node-public-ip>.45222 > <apiserver-public-ip>.https: Flags [.], ack 1, win 502, options [nop,nop,TS val 1976538789 ecr 1097293368], length 0
06:43:51.265261 IP <worker-node-public-ip>.45222 > <apiserver-public-ip>.https: Flags [P.], seq 1:518, ack 1, win 502, options [nop,nop,TS val 1976538813 ecr 1097293368], length 517
06:43:51.265688 IP <apiserver-public-ip>.https > <worker-node-public-ip>.45222: Flags [.], ack 518, win 506, options [nop,nop,TS val 1097293394 ecr 1976538813], length 0
06:43:51.329529 IP <apiserver-public-ip>.https > <worker-node-public-ip>.45222: Flags [P.], seq 1:1957, ack 518, win 506, options [nop,nop,TS val 1097293458 ecr 1976538813], length 1956
06:43:51.329749 IP <worker-node-public-ip>.45222 > <apiserver-public-ip>.https: Flags [.], ack 1957, win 493, options [nop,nop,TS val 1976538878 ecr 1097293458], length 0
06:43:51.331801 IP <worker-node-public-ip>.45222 > <apiserver-public-ip>.https: Flags [P.], seq 518:628, ack 1957, win 501, options [nop,nop,TS val 1976538880 ecr 1097293458], length 110
06:43:51.332126 IP <apiserver-public-ip>.https > <worker-node-public-ip>.45222: Flags [.], ack 628, win 506, options [nop,nop,TS val 1097293460 ecr 1976538880], length 0
06:43:51.332349 IP <worker-node-public-ip>.45222 > <apiserver-public-ip>.https: Flags [P.], seq 628:674, ack 1957, win 501, options [nop,nop,TS val 1976538881 ecr 1097293460], length 46
06:43:51.332577 IP <worker-node-public-ip>.45222 > <apiserver-public-ip>.https: Flags [P.], seq 674:723, ack 1957, win 501, options [nop,nop,TS val 1976538881 ecr 1097293460], length 49
06:43:51.332644 IP <apiserver-public-ip>.https > <worker-node-public-ip>.45222: Flags [.], ack 674, win 506, options [nop,nop,TS val 1097293461 ecr 1976538881], length 0
06:43:51.332771 IP <apiserver-public-ip>.https > <worker-node-public-ip>.45222: Flags [.], ack 723, win 506, options [nop,nop,TS val 1097293461 ecr 1976538881], length 0
06:43:51.332894 IP <worker-node-public-ip>.45222 > <apiserver-public-ip>.https: Flags [P.], seq 723:758, ack 1957, win 501, options [nop,nop,TS val 1976538881 ecr 1097293461], length 35
06:43:51.369652 IP <apiserver-public-ip>.https > <worker-node-public-ip>.45222: Flags [P.], seq 1957:2125, ack 816, win 506, options [nop,nop,TS val 1097293497 ecr 1976538881], length 168
06:43:51.369774 IP <apiserver-public-ip>.https > <worker-node-public-ip>.45222: Flags [P.], seq 2125:2180, ack 816, win 506, options [nop,nop,TS val 1097293498 ecr 1976538881], length 55
06:43:51.369807 IP <apiserver-public-ip>.https > <worker-node-public-ip>.45222: Flags [P.], seq 2180:2224, ack 816, win 506, options [nop,nop,TS val 1097293498 ecr 1976538881], length 44
06:43:51.370351 IP <worker-node-public-ip>.45222 > <apiserver-public-ip>.https: Flags [P.], seq 816:847, ack 2224, win 501, options [nop,nop,TS val 1976538919 ecr 1097293497], length 31
06:43:51.376665 IP <apiserver-public-ip>.https > <worker-node-public-ip>.45222: Flags [P.], seq 2224:2478, ack 847, win 506, options [nop,nop,TS val 1097293505 ecr 1976538919], length 254
06:43:51.376791 IP <apiserver-public-ip>.https > <worker-node-public-ip>.45222: Flags [P.], seq 2478:2742, ack 847, win 506, options [nop,nop,TS val 1097293505 ecr 1976538919], length 264
06:43:51.377039 IP <worker-node-public-ip>.45222 > <apiserver-public-ip>.https: Flags [.], ack 2742, win 501, options [nop,nop,TS val 1976538925 ecr 1097293505], length 0
06:43:51.377304 IP <worker-node-public-ip>.45222 > <apiserver-public-ip>.https: Flags [P.], seq 847:871, ack 2742, win 501, options [nop,nop,TS val 1976538926 ecr 1097293505], length 24
06:43:51.379755 IP <apiserver-public-ip>.https > <worker-node-public-ip>.45222: Flags [P.], seq 2742:2766, ack 871, win 506, options [nop,nop,TS val 1097293508 ecr 1976538926], length 24
06:43:51.379837 IP <apiserver-public-ip>.https > <worker-node-public-ip>.45222: Flags [F.], seq 2766, ack 871, win 506, options [nop,nop,TS val 1097293508 ecr 1976538926], length 0
06:43:51.379954 IP <worker-node-public-ip>.45222 > <apiserver-public-ip>.https: Flags [F.], seq 871, ack 2742, win 501, options [nop,nop,TS val 1976538928 ecr 1097293505], length 0
06:43:51.380001 IP <worker-node-public-ip>.45222 > <apiserver-public-ip>.https: Flags [R], seq 1772157892, win 0, length 0
06:43:51.380034 IP <worker-node-public-ip>.45222 > <apiserver-public-ip>.https: Flags [R], seq 1772157892, win 0, length 0
06:43:51.380835 IP <apiserver-public-ip>.https > <worker-node-public-ip>.45222: Flags [.], ack 872, win 506, options [nop,nop,TS val 1097293509 ecr 1976538928], length 0
06:43:51.380952 IP <worker-node-public-ip>.45222 > <apiserver-public-ip>.https: Flags [R], seq 1772157893, win 0, length 0
This is the normal/expected behavior when the bug is not affecting a given pod.
normal apiserver connection, cilium monitor output
I won’t bother showing the full cilium monitor output since I think probably just the first few lines are relevant:
CPU 01: MARK 0x8005cec1 FROM 3581 DEBUG: Conntrack lookup 1/2: src=10.244.4.203:45368 dst=10.245.0.1:443
CPU 01: MARK 0x8005cec1 FROM 3581 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=4
CPU 01: MARK 0x8005cec1 FROM 3581 DEBUG: CT verdict: New, revnat=0
CPU 01: MARK 0x8005cec1 FROM 3581 DEBUG: Service backend slot lookup: slot=1, dport=443
CPU 01: MARK 0x8005cec1 FROM 3581 DEBUG: Conntrack create: proxy-port=0 revnat=1 src-identity=0 lb=0.0.0.0
CPU 01: MARK 0x8005cec1 FROM 3581 DEBUG: Conntrack lookup 1/2: src=10.244.4.203:45368 dst=<apiserver-public-ip>:443
CPU 01: MARK 0x8005cec1 FROM 3581 DEBUG: Conntrack lookup 2/2: nexthdr=6 flags=1
CPU 01: MARK 0x8005cec1 FROM 3581 DEBUG: CT verdict: New, revnat=0
CPU 01: MARK 0x8005cec1 FROM 3581 DEBUG: Successfully mapped addr=<apiserver-public-ip> to identity=2
CPU 01: MARK 0x8005cec1 FROM 3581 DEBUG: Conntrack create: proxy-port=0 revnat=1 src-identity=20254 lb=<apiserver-public-ip>
-> stack flow 0x8005cec1 identity 20254->world state new ifindex 0 orig-ip 0.0.0.0: 10.244.4.203:45368 -> <apiserver-public-ip>:443 tcp SYN
-> 0: 10.244.4.203:45368 -> <apiserver-public-ip>:443 tcp SYN
Here we can see that, in contrast with the fail mode cilium monitor output the conntrack result points the packets to the correct destination – the public IP address of the apiserver droplet.
things that fix it
So I have in the course of debugging this issue identified a couple approaches that seem to resolve the issue:
- At the worker node level, running
cilium bpf ct flush globalon the node’s cilium agent seems to do the trick. This makes sense given that from thecilium monitorlogs this appears to be a conntrack bug of some sort. - At the cluster level, deleting the coredns pods and allowing their associated deployment to re-provision them seems to clear the problem up entirely.
I wanted to include these here in case they provide any clues for maintainers who may have a better idea of what’s going on.
what is going on?
Is this a bug in Cilium? Or could it be some kind of race condition or cluster misconfiguration? I’m happy to share any additional details that may be useful.
One thing I can’t do is tell you what helm flags were used to set up cilium since we generated manifests from helm a long time ago and have been evolving those manifests over time as we release new versions of our product. If anyone thinks it’s helpful I can include cilium config map and pod specs.
Cilium Version
- 1.10.4
- 1.10.12
Kernel Version
- 4.19.194-3
- 5.10.120-1
Kubernetes Version
- 1.21.3
- 1.22.8
Sysdump
No response
Relevant log output
No response
Anything else?
I have no idea if this is really a conntrack issue or a datapath issue or something else entirely so please forgive my naive issue title 😃
Code of Conduct
- I agree to follow this project’s Code of Conduct
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 3
- Comments: 22 (8 by maintainers)
This still seems relevant.