cilium: Cilium Identity not getting deleted sometimes in large scale cluster
Is there an existing issue for this?
- I have searched the existing issues
What happened?
Created AKS azure cni powered by cilium cluster with 1k nodes with 1000 deployments, each with 100 pods. Then scale down the deployments to 0 and expect cilium identity should get removed in few hours. I noticed few identities not getting removed even for days. To debug, added few more logs in cilium operator to find where the problem is. I found delete identity is being triggered and followed by identity is marked alive. There are no pods/cep running in that namespace but not sure how idenity is being marked alive. Added a log to find where identity is being marked alive and found its happening in runHeartbeatUpdater function: https://github.com/cilium/cilium/blob/d1162cc2b596d55a8235fd67e034bb5cfe98b173/operator/identitygc/crd_gc.go#L49
Pulled latest cilium master and added logs on top of it and then created an image out of it.
k get ciliumidentity
NAME NAMESPACE AGE
17682 kube-system 13d
19740 kube-system 13d
1978 kube-system 13d
20776 kube-system 54d
30723 kube-system 13d
3647 kube-system 54d
40097 new-testing 2m49s
42123 kube-system 54d
k get pods -n new-testing
No resources found in new-testing namespace.
k get cep -n new-testing
No resources found in new-testing namespace.
Operatior Logs:
2023-06-27T16:08:11.157678430Z level=debug msg=“Deleting unused identity; marked for deletion at 2023-06-27T16:07:11.142122029Z” identity=“&{{ } {40097 f0c65750-fa9e-418a-9c22-05ffb16040e9 526408139 1 2023-06-27 16:05:41 +0000 UTC <nil> <nil> map[app:new-testing io.cilium.k8s.policy.cluster:default io.cilium.k8s.policy.serviceaccount:default io.kubernetes.pod.namespace:new-testing is-real:true name:real-dep-00541 real-dep-lab-00541-00001:val real-dep-lab-00541-00002:val real-dep-lab-00541-00003:val real-dep-lab-00541-00004:val real-dep-lab-00541-00005:val] map[io.cilium.heartbeat:2023-06-27T16:07:11.142122029Z] [] [] [{cilium-agent Update cilium.io/v2 2023-06-27 16:05:41 +0000 UTC FieldsV1 {"f:metadata":{"f:labels":{".":{},"f:app":{},"f:io.cilium.k8s.policy.cluster":{},"f:io.cilium.k8s.policy.serviceaccount":{},"f:io.kubernetes.pod.namespace":{},"f:is-real":{},"f:name":{},"f:real-dep-lab-00541-00001":{},"f:real-dep-lab-00541-00002":{},"f:real-dep-lab-00541-00003":{},"f:real-dep-lab-00541-00004":{},"f:real-dep-lab-00541-00005":{}}},"f:security-labels":{".":{},"f:k8s:app":{},"f:k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name":{},"f:k8s:io.cilium.k8s.policy.cluster":{},"f:k8s:io.cilium.k8s.policy.serviceaccount":{},"f:k8s:io.kubernetes.pod.namespace":{},"f:k8s:is-real":{},"f:k8s:name":{},"f:k8s:real-dep-lab-00541-00001":{},"f:k8s:real-dep-lab-00541-00002":{},"f:k8s:real-dep-lab-00541-00003":{},"f:k8s:real-dep-lab-00541-00004":{},"f:k8s:real-dep-lab-00541-00005":{}}} } {cilium-operator-generic Update cilium.io/v2 2023-06-27 16:07:11 +0000 UTC FieldsV1 {"f:metadata":{"f:annotations":{".":{},"f:io.cilium.heartbeat":{}}}} }]} map[k8s:app:new-testing k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name:new-testing k8s:io.cilium.k8s.policy.cluster:default k8s:io.cilium.k8s.policy.serviceaccount:default k8s:io.kubernetes.pod.namespace:new-testing k8s:is-real:true k8s:name:real-dep-00541 k8s:real-dep-lab-00541-00001:val k8s:real-dep-lab-00541-00002:val k8s:real-dep-lab-00541-00003:val k8s:real-dep-lab-00541-00004:val k8s:real-dep-lab-00541-00005:val]}” subsys=identity-heartbeat 2023-06-27T16:08:11.168553712Z level=debug msg=“Garbage collected identity” identity=40097 subsys=identity-heartbeat 2023-06-27T16:08:11.168582313Z level=info msg=“CE associated with identity. so mark true” identity=42123 subsys=identity-heartbeat 2023-06-27T16:08:11.168588113Z level=debug msg=“Marking identity alive” identity=42123 subsys=identity-heartbeat 2023-06-27T16:08:11.168593213Z level=info msg=“CE associated with identity. so mark true” identity=17682 subsys=identity-heartbeat 2023-06-27T16:08:11.168623414Z level=debug msg=“Marking identity alive” identity=17682 subsys=identity-heartbeat 2023-06-27T16:08:11.168628814Z level=info msg=“CE associated with identity. so mark true” identity=19740 subsys=identity-heartbeat 2023-06-27T16:08:11.168633314Z level=debug msg=“Marking identity alive” identity=19740 subsys=identity-heartbeat 2023-06-27T16:08:11.168640214Z level=debug msg=“Controller func execution time: 11.514098ms” name=crd-identity-gc subsys=controller uuid=42621a4e-9dc8-4da9-9622-d0c3dac5eef3 2023-06-27T16:08:11.168804718Z level=debug msg=“Deleting identity in heartbeat lifesign table” identity=40097 subsys=identity-heartbeat (added by me) 2023-06-27T16:08:11.177053432Z level=info msg=“heartbeat updater triggered” identity=40097 subsys=identity-heartbeat 2023-06-27T16:08:11.177079633Z level=debug msg=“Marking identity alive” identity=40097 subsys=identity-heartbeat
Cilium Version
latest cilium (master)
Kernel Version
5.15
Kubernetes Version
1.25.6
Sysdump
No response
Relevant log output
No response
Anything else?
No response
Code of Conduct
- I agree to follow this project’s Code of Conduct
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 21 (12 by maintainers)
Hi @tamilmani1989 , unfortunately I didn’t have cycles to investigate more. If you could do another try with that latest patch it would be great. Let us know the results and will try to get back to this as soon as possible. Thank you!
This could potentially be involved here: https://github.com/cilium/cilium/pull/27340. If you have a reliable repro, could you perhaps try a build with this patch and see if it still happens?
@pippolo84 sure I will try to repro this with debug build and share logs with you.
I tried to reproduce this in a local kind cluster applying deployments with many replicas and scaling them down and up, but unfortunately no luck yet.
@tamilmani1989 would you please be able to run a debug version of Cilium in your cluster? I’ve already prepared this PR where I added some logs to better understand the identities management by both agent and operator. We would need to try this version in your large scale cluster and then collect:
Then, focusing on one of the leaked identity:
kubectl get -o yamlfor that identityThat would be incredibly helpful. Thank you in advance for the help 🙏
Quick update: after looking at the code together with @aanm and @marseel (thanks to both 🙏 ), we think there might be an issue with the endpoint release codepath in case of high concurrency. I’m going to setup a local cluster to prove that hypothesis.
@tamilmani1989 thank you very much for the additional details. 🙏 I should have enough cycles to take a deeper look at this early next week. Skimming through the info you reported, I think they should be enough, no need to do anything else for now.
I don’t quite see how we could get a delete followed by upsert from
Resource[T], but looking atcrd_gc.gowe might have a potential race betweengc()andrunHeartbeatUpdater()which both manipulate the heartbeatStore, e.g. we might havegc()doing an action based on latest data and thenrunHeartbeatUpdate()following with an action based on stale data, or other way around.I’d approach this by trying to create a test case in
gc_test.gothat churns through a lot of identities and see if this race can be triggered and then go from there. I would not rule out a bug inResource[T], but would start from the assumption thatcrd_gc.gohas a race that causes this. It’s likely worth refactoring it to work on top of a single consume of the event stream to make sure actions are always consistent (e.g. don’t useStoreat all).@pippolo84 would you have time to look into this?