cilium: Services are unreachable after node is restarted

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

When I install cilium in version >=1.14.0 and restart a Kubernetes node the node’s pod cannot reach any service.

Reproduction:

  • Install Cilium 1.14.1 on Kubernetes 1.28
  • Restart a Kubernetes node kubectl debug node/<node-name> --image=ubuntu -- bash -c "echo reboot > reboot.sh && chroot /host < reboot.sh"
  • look at any pod that needs to connect to e.g. the APIServer and observe {"level":"error","ts":"2023-08-31T12:25:29Z","logger":"setup","msg":"unable to start manager","error":"Get \"https://10.96.0.1:443/api?timeout=32s\": dial tcp 10.96.0.1:443: i/o timeout"}

One can also observe the attached traffic flow when doing an nslookup inside the pod on this node:

root@fedora:/home/cilium# tcpdump -n -i any host 10.96.0.10 or host 10.244.0.24
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
12:33:05.057842 lxcf96ef40da1c4 In  IP 10.244.0.24.33911 > 10.96.0.10.53: 4076+ A? kubernetes.default.default.svc.cluster.local. (62)
12:33:05.057919 ens5  Out IP 192.168.179.24.33911 > 10.96.0.10.53: 4076+ A? kubernetes.default.default.svc.cluster.local. (62)
12:33:10.057791 lxcf96ef40da1c4 In  IP 10.244.0.24.33911 > 10.96.0.10.53: 4076+ A? kubernetes.default.default.svc.cluster.local. (62)
12:33:10.057817 ens5  Out IP 192.168.179.24.33911 > 10.96.0.10.53: 4076+ A? kubernetes.default.default.svc.cluster.local. (62)
12:33:15.057884 lxcf96ef40da1c4 In  IP 10.244.0.24.33911 > 10.96.0.10.53: 4076+ A? kubernetes.default.default.svc.cluster.local. (62)
12:33:15.057915 ens5  Out IP 192.168.179.24.33911 > 10.96.0.10.53: 4076+ A? kubernetes.default.default.svc.cluster.local. (62)

This should look like a flow on a not yet restarted node:

root@fedora:/home/cilium# tcpdump -n -i any host 10.96.0.10 or host 10.244.3.160
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
12:36:12.381250 lxc15c1e0437997 In  IP 10.244.3.160.35892 > 10.244.1.159.53: 60216+ A? kubernetes.default.default.svc.cluster.local. (62)
12:36:12.381262 cilium_vxlan Out IP 10.244.3.160.35892 > 10.244.1.159.53: 60216+ A? kubernetes.default.default.svc.cluster.local. (62)
12:36:12.381741 cilium_vxlan P   IP 10.244.1.159.53 > 10.244.3.160.35892: 60216 NXDomain*- 0/1/0 (155)
12:36:12.381763 lxc15c1e0437997 Out IP 10.244.1.159.53 > 10.244.3.160.35892: 60216 NXDomain*- 0/1/0 (155)
12:36:12.381998 lxc15c1e0437997 In  IP 10.244.3.160.39099 > 10.244.2.24.53: 43204+ A? kubernetes.default.svc.cluster.local. (54)
12:36:12.382009 cilium_vxlan Out IP 10.244.3.160.39099 > 10.244.2.24.53: 43204+ A? kubernetes.default.svc.cluster.local. (54)
12:36:12.382417 cilium_vxlan P   IP 10.244.2.24.53 > 10.244.3.160.39099: 43204*- 1/0/0 A 10.96.0.1 (106)
12:36:12.382450 lxc15c1e0437997 Out IP 10.244.2.24.53 > 10.244.3.160.39099: 43204*- 1/0/0 A 10.96.0.1 (106)
12:36:12.382728 lxc15c1e0437997 In  IP 10.244.3.160.33187 > 10.244.1.159.53: 28871+ AAAA? kubernetes.default.svc.cluster.local. (54)
12:36:12.382735 cilium_vxlan Out IP 10.244.3.160.33187 > 10.244.1.159.53: 28871+ AAAA? kubernetes.default.svc.cluster.local. (54)
12:36:12.383005 cilium_vxlan P   IP 10.244.1.159.53 > 10.244.3.160.33187: 28871*- 0/1/0 (147)
12:36:12.383026 lxc15c1e0437997 Out IP 10.244.1.159.53 > 10.244.3.160.33187: 28871*- 0/1/0 (147)

Cilium Version

1.14.0, 1.14.1, main (90a9402d2342a12774b8c3eebd67de5bad572472)

Kernel Version

Linux fedora 6.1.45-100.constellation.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Aug 14 17:39:05 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

Server Version: version.Info{Major:“1”, Minor:“28”, GitVersion:“v1.28.0”, GitCommit:“855e7c48de7388eb330da0f8d9d2394ee818fb8d”, GitTreeState:“clean”, BuildDate:“2023-08-15T10:15:54Z”, GoVersion:“go1.20.7”, Compiler:“gc”, Platform:“linux/amd64”}

Sysdump

cilium-sysdump-20230831-142049.zip

Relevant log output

No response

Anything else?

The bug does not happen on 1.13.6 and I bisected the problem to the following commit: 68fd9eeec16e015221b52f789f62447e8b1e16eb (i.e. this commit is the first in which this error occurs). When using Kubernetes 1.27 or 1.26 the problem disappears.

Code of Conduct

  • I agree to follow this project’s Code of Conduct

About this issue

  • Original URL
  • State: closed
  • Created 10 months ago
  • Comments: 24 (24 by maintainers)

Most upvoted comments

I’m currently unable to repro the issue on a 1.28 kind cluster, tried restarting a kind node as well as restarting the cilium agents to trigger link deletion/recreation and updating an existing link

@rgo3 https://github.com/cilium/cilium/issues/27900#issuecomment-1709590527

Sorry for the delay. I can confirm that using K8s v1.28.2 fixes the reported problem. Thanks for the help and investigation.

@3u13r would you be able to test it again with Kubernetes 1.28.2? Thank you

Okay just for clarity I got my k3s cluster in a bad state. I’m going to try the downgrade of cilium to 1.13.6 on it just to confirm I’m seeing the same issue now. I’ll update this comment with more info. Sorry for the noise with the spurious other issue reproducer

So i rolled back to cilium 1.13.6 on my k3s 2 node intel nuc cluster using a k8s 1.28, confirmed all pods were up and running and cilium status is in the greent, rebooted both my k3s nodes and coreDNS pods didn’t come back up…so its same symptoms as my kind cluster reproducer…

Do i need to document this further, what information should i provide?

There are a lot of moving parts here the k3s images are release candidates themselves, so ruling out k3s rc specific “type 2 fun” {tm} is not possible.

kubectl get nodes
NAME     STATUS   ROLES                  AGE   VERSION
nuc-02   Ready    control-plane,master   42h   v1.28.1-rc2+k3s1
nuc-01   Ready    <none>                 39h   v1.28.1-rc2+k3s1
cilium version
cilium-cli: v0.15.7 compiled with go1.21.0 on linux/amd64
cilium image (default): v1.14.1
cilium image (stable): v1.14.1
cilium image (running): 1.13.6
kubectl logs -n kube-system coredns-77ccd57875-p4lbq  |grep WARNING
[WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://10.43.0.1:443/version": dial tcp 10.43.0.1:443: i/o timeout

sysdump attached cilium-sysdump-20230123-123225.zip

What’s the bisected commit ID you were referring to? Is it the same as https://github.com/cilium/cilium/commit/68fd9eeec16e015221b52f789f62447e8b1e16eb?

Yes this is the commit that breaks the behavior on K8s 1.28 for me.

Looks like there could be multiple issues at play here. 😕 Issue (1) k8s v1.28 regression, see - https://github.com/cilium/cilium/issues/27900#issuecomment-1711958829. This might be exposing some other issues in the loader logic as @rgo3 pointed out due to which BPF cgroup programs are not getting attached at all after node restart?

I’m 99% sure that the K8s upstream fix mentioned in the other issue will also solve this issue. Note that the order of dependency changed in the commit I bisected i.e. before that commit the pinning took place before the cilium-agent container and then after all the other init-containers.

👍 Could you confirm the bisected commit, please? Thanks!

Just waiting for a new K8s release now. If you want I’m fine with closing this issue since it is a K8s upstream bug.

I would suggesting keeping this issue open for the potential issue in the loader logic.

(Edit: Sorry, I accidentally deleted my previous comment while editing.)

okay @rgo3 so this one is probably not reproducible just with kind. Maybe I need to kick my k3s home lab worker node in just the right way…

@jspaleta I think your reproducers trigger a different issue than what @3u13r is seeing. In your case, after a restart the cgroup BPF progs are attached to the wrong cgroup as @brb has shown in https://github.com/cilium/cilium/issues/27900#issuecomment-1709590527. From what I could see when debugging, this happens because the /run/cilium/cgroupv2 points to different “locations” (not sure if that is the correct terminology here). This also explains why you can see the issue happening across cilium versions (1.14, 1.13.6 respectively) because even though https://github.com/cilium/cilium/commit/68fd9eeec16e015221b52f789f62447e8b1e16eb changes how we attach cgroup programs, in both versions we try to attach to /run/cilium/cgroupv2. To understand the issue in your reproducer we’ll need to understand why this doesn’t point to the same cgroup root across restarts and if it only happens in kind.

However in @3u13r 's case it looks like we have a bug when updating the bpf links after a restart. My current suspicion is that retrieving the link from bpffs and just update the program after a restart doesn’t work for some reason because no cgroup progs are attached anymore. But as mentioned before I don’t have a good reproducer for this exact behavior yet 😞