cilium: Deadlock in Cilium Agent
Is there an existing issue for this?
- I have searched the existing issues
What happened?
After a while running (typically for us, several days), a Cilium agent will start to become unresponsive, failing its liveness checks which causes it to be restarted by Kubernetes. Since we leverage FQDN rules a lot, this ends up causing network disruption on the given host.
Cilium Version
1.12.5 820a3086ad 2022-12-22T16:16:56+00:00 go version go1.18.9 linux/amd64
This is essentially the released 1.12.5 + this PR backported: https://github.com/cilium/cilium/pull/22252
Kernel Version
5.4.228-131.415.amzn2.x86_64
Kubernetes Version
v1.23.14-eks-ffeb93d
Sysdump
No response
Relevant log output
No response
Anything else?
The entire threaddump can be found here: cilium-wh7xn.log
This was captured by setting a preStop lifecycle hook to run kill -s ABRT 1.
This issue started occurring with v1.12.2 for us.
cc @joaoubaldo @carloscastrojumo
From our dashboards, this is what we can observe:
The restart of the agent/agent going unhealthy happens shortly after a large amount of endpoints not ready.
We have other metrics and logs if needed so let us know if you need more information for troubleshooting.
Would you recommend running an image compiled with the deadlock detection flag in production? We unfortunately cannot replicate this issue in a testing cluster.
Code of Conduct
- I agree to follow this project’s Code of Conduct
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 15 (11 by maintainers)
Commits related to this issue
- ipam/crd: Fix panic due to concurrent map read and map write This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSize` required that the `crdAllocator` mutex was held. This howe... — committed to gandro/cilium by gandro a year ago
- ipam/crd: Fix panic due to concurrent map read and map write This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSize` required that the `crdAllocator` mutex was held. This howe... — committed to cilium/cilium by gandro a year ago
- ipam/crd: Fix panic due to concurrent map read and map write [ upstream commit e3a78b0d8e692ba375b35e74d2ca699f8a9e79bb ] This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSi... — committed to pchaigno/cilium by gandro a year ago
- ipam/crd: Fix panic due to concurrent map read and map write [ upstream commit e3a78b0d8e692ba375b35e74d2ca699f8a9e79bb ] This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSi... — committed to cilium/cilium by gandro a year ago
- ipam/crd: Fix panic due to concurrent map read and map write [ upstream commit e3a78b0d8e692ba375b35e74d2ca699f8a9e79bb ] This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSi... — committed to pchaigno/cilium by gandro a year ago
- ipam/crd: Fix panic due to concurrent map read and map write [ upstream commit e3a78b0d8e692ba375b35e74d2ca699f8a9e79bb ] This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSi... — committed to cilium/cilium by gandro a year ago
- ipam/crd: Fix panic due to concurrent map read and map write [ upstream commit e3a78b0d8e692ba375b35e74d2ca699f8a9e79bb ] This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSi... — committed to sayboras/cilium by gandro a year ago
- ipam/crd: Fix panic due to concurrent map read and map write [ upstream commit e3a78b0d8e692ba375b35e74d2ca699f8a9e79bb ] This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSi... — committed to cilium/cilium by gandro a year ago
Hello @gandro
Just confirming that in the last ~74 days, we have not had a single occurrence of the crash using our custom image based on v1.12.9.
#26242 also looks similar. We are in the process of upgrading to v1.13 so I will look out for it!
Thanks for your work on this!
@lukaselmer I would therefore recommend you try out 1.12.9 if you’re experiencing similar issues. Like I mentioned, we haven’t had an issue in > 74 days, which is pretty good going.
Note that we fixed another related issue https://github.com/cilium/cilium/pull/26242 which however is still in the process of getting backported to v1.13
We’ve merged a deadlock fix around IPCache and FQDN (#24672) which potentially sounds like this was the issue here. Please test again with v1.12.9 which should be released next week.
Hey @gandro
Thank you so much for looking into this issue. We indeed saw the
fatal error: concurrent map read and map writebut were debating whether it might actually have occurred because of thekill -s ABRT.I’ve checked https://github.com/cilium/cilium/pull/23377 and we have indeed seen some occurrences of this log message followed by agent restarts (3 in Dev, 1 in Prod, in the last ~15 days). So we will definitely get this into our image.
We haven’t gotten any other threadd umps yet but we will add them here when we get them.
Thanks again!
I’ve opened a PR to fix the
fatal error: concurrent map read and map writeissue here: https://github.com/cilium/cilium/pull/23713It’s still possible however that you also observed a deadlock. Thus, if you have additional threadumps from other runs that would be helpful to confirm that there are indeed no other unknown problems.
I checked back with my colleagues and unfortunately, we do not recommend running
lockdebugin production. It’s overhead for each locking operation is too high. Fetching a threadump as you did is probably the better approach.