cilium: Cilium Identity not getting deleted sometimes in large scale cluster

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Created AKS azure cni powered by cilium cluster with 1k nodes with 1000 deployments, each with 100 pods. Then scale down the deployments to 0 and expect cilium identity should get removed in few hours. I noticed few identities not getting removed even for days. To debug, added few more logs in cilium operator to find where the problem is. I found delete identity is being triggered and followed by identity is marked alive. There are no pods/cep running in that namespace but not sure how idenity is being marked alive. Added a log to find where identity is being marked alive and found its happening in runHeartbeatUpdater function: https://github.com/cilium/cilium/blob/d1162cc2b596d55a8235fd67e034bb5cfe98b173/operator/identitygc/crd_gc.go#L49

Pulled latest cilium master and added logs on top of it and then created an image out of it.

k get ciliumidentity NAME NAMESPACE AGE 17682 kube-system 13d 19740 kube-system 13d 1978 kube-system 13d 20776 kube-system 54d 30723 kube-system 13d 3647 kube-system 54d 40097 new-testing 2m49s 42123 kube-system 54d

k get pods -n new-testing No resources found in new-testing namespace.

k get cep -n new-testing No resources found in new-testing namespace.

Operatior Logs:

2023-06-27T16:08:11.157678430Z level=debug msg=“Deleting unused identity; marked for deletion at 2023-06-27T16:07:11.142122029Z” identity=“&{{ } {40097 f0c65750-fa9e-418a-9c22-05ffb16040e9 526408139 1 2023-06-27 16:05:41 +0000 UTC <nil> <nil> map[app:new-testing io.cilium.k8s.policy.cluster:default io.cilium.k8s.policy.serviceaccount:default io.kubernetes.pod.namespace:new-testing is-real:true name:real-dep-00541 real-dep-lab-00541-00001:val real-dep-lab-00541-00002:val real-dep-lab-00541-00003:val real-dep-lab-00541-00004:val real-dep-lab-00541-00005:val] map[io.cilium.heartbeat:2023-06-27T16:07:11.142122029Z] [] [] [{cilium-agent Update cilium.io/v2 2023-06-27 16:05:41 +0000 UTC FieldsV1 {"f:metadata":{"f:labels":{".":{},"f:app":{},"f:io.cilium.k8s.policy.cluster":{},"f:io.cilium.k8s.policy.serviceaccount":{},"f:io.kubernetes.pod.namespace":{},"f:is-real":{},"f:name":{},"f:real-dep-lab-00541-00001":{},"f:real-dep-lab-00541-00002":{},"f:real-dep-lab-00541-00003":{},"f:real-dep-lab-00541-00004":{},"f:real-dep-lab-00541-00005":{}}},"f:security-labels":{".":{},"f:k8s:app":{},"f:k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name":{},"f:k8s:io.cilium.k8s.policy.cluster":{},"f:k8s:io.cilium.k8s.policy.serviceaccount":{},"f:k8s:io.kubernetes.pod.namespace":{},"f:k8s:is-real":{},"f:k8s:name":{},"f:k8s:real-dep-lab-00541-00001":{},"f:k8s:real-dep-lab-00541-00002":{},"f:k8s:real-dep-lab-00541-00003":{},"f:k8s:real-dep-lab-00541-00004":{},"f:k8s:real-dep-lab-00541-00005":{}}} } {cilium-operator-generic Update cilium.io/v2 2023-06-27 16:07:11 +0000 UTC FieldsV1 {"f:metadata":{"f:annotations":{".":{},"f:io.cilium.heartbeat":{}}}} }]} map[k8s:app:new-testing k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name:new-testing k8s:io.cilium.k8s.policy.cluster:default k8s:io.cilium.k8s.policy.serviceaccount:default k8s:io.kubernetes.pod.namespace:new-testing k8s:is-real:true k8s:name:real-dep-00541 k8s:real-dep-lab-00541-00001:val k8s:real-dep-lab-00541-00002:val k8s:real-dep-lab-00541-00003:val k8s:real-dep-lab-00541-00004:val k8s:real-dep-lab-00541-00005:val]}” subsys=identity-heartbeat 2023-06-27T16:08:11.168553712Z level=debug msg=“Garbage collected identity” identity=40097 subsys=identity-heartbeat 2023-06-27T16:08:11.168582313Z level=info msg=“CE associated with identity. so mark true” identity=42123 subsys=identity-heartbeat 2023-06-27T16:08:11.168588113Z level=debug msg=“Marking identity alive” identity=42123 subsys=identity-heartbeat 2023-06-27T16:08:11.168593213Z level=info msg=“CE associated with identity. so mark true” identity=17682 subsys=identity-heartbeat 2023-06-27T16:08:11.168623414Z level=debug msg=“Marking identity alive” identity=17682 subsys=identity-heartbeat 2023-06-27T16:08:11.168628814Z level=info msg=“CE associated with identity. so mark true” identity=19740 subsys=identity-heartbeat 2023-06-27T16:08:11.168633314Z level=debug msg=“Marking identity alive” identity=19740 subsys=identity-heartbeat 2023-06-27T16:08:11.168640214Z level=debug msg=“Controller func execution time: 11.514098ms” name=crd-identity-gc subsys=controller uuid=42621a4e-9dc8-4da9-9622-d0c3dac5eef3 2023-06-27T16:08:11.168804718Z level=debug msg=“Deleting identity in heartbeat lifesign table” identity=40097 subsys=identity-heartbeat (added by me) 2023-06-27T16:08:11.177053432Z level=info msg=“heartbeat updater triggered” identity=40097 subsys=identity-heartbeat 2023-06-27T16:08:11.177079633Z level=debug msg=“Marking identity alive” identity=40097 subsys=identity-heartbeat

Cilium Version

latest cilium (master)

Kernel Version

5.15

Kubernetes Version

1.25.6

Sysdump

No response

Relevant log output

No response

Anything else?

No response

Code of Conduct

  • I agree to follow this project’s Code of Conduct

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 21 (12 by maintainers)

Most upvoted comments

Hi @tamilmani1989 , unfortunately I didn’t have cycles to investigate more. If you could do another try with that latest patch it would be great. Let us know the results and will try to get back to this as soon as possible. Thank you!

This could potentially be involved here: https://github.com/cilium/cilium/pull/27340. If you have a reliable repro, could you perhaps try a build with this patch and see if it still happens?

@pippolo84 sure I will try to repro this with debug build and share logs with you.

I tried to reproduce this in a local kind cluster applying deployments with many replicas and scaling them down and up, but unfortunately no luck yet.

@tamilmani1989 would you please be able to run a debug version of Cilium in your cluster? I’ve already prepared this PR where I added some logs to better understand the identities management by both agent and operator. We would need to try this version in your large scale cluster and then collect:

  • the list of commands used to repro the issue
  • the identities leaked and the namespace

Then, focusing on one of the leaked identity:

  • output of kubectl get -o yaml for that identity
  • the full operator logs
  • the full logs of all the agents where you can find an occurrence of that identity
  • if possible, a sysdump for the node where the agent is recreating/updating that identity

That would be incredibly helpful. Thank you in advance for the help 🙏

Quick update: after looking at the code together with @aanm and @marseel (thanks to both 🙏 ), we think there might be an issue with the endpoint release codepath in case of high concurrency. I’m going to setup a local cluster to prove that hypothesis.

@tamilmani1989 thank you very much for the additional details. 🙏 I should have enough cycles to take a deeper look at this early next week. Skimming through the info you reported, I think they should be enough, no need to do anything else for now.

I don’t quite see how we could get a delete followed by upsert from Resource[T], but looking at crd_gc.go we might have a potential race between gc() and runHeartbeatUpdater() which both manipulate the heartbeatStore, e.g. we might have gc() doing an action based on latest data and then runHeartbeatUpdate() following with an action based on stale data, or other way around.

I’d approach this by trying to create a test case in gc_test.go that churns through a lot of identities and see if this race can be triggered and then go from there. I would not rule out a bug in Resource[T], but would start from the assumption that crd_gc.go has a race that causes this. It’s likely worth refactoring it to work on top of a single consume of the event stream to make sure actions are always consistent (e.g. don’t use Store at all).

@pippolo84 would you have time to look into this?