cilium: Deadlock in Cilium Agent

Is there an existing issue for this?

I have searched the existing issues

What happened?

After a while running (typically for us, several days), a Cilium agent will start to become unresponsive, failing its liveness checks which causes it to be restarted by Kubernetes. Since we leverage FQDN rules a lot, this ends up causing network disruption on the given host.

Cilium Version

1.12.5 820a3086ad 2022-12-22T16:16:56+00:00 go version go1.18.9 linux/amd64

This is essentially the released 1.12.5 + this PR backported: https://github.com/cilium/cilium/pull/22252

Kernel Version

5.4.228-131.415.amzn2.x86_64

Kubernetes Version

v1.23.14-eks-ffeb93d

Sysdump

No response

Relevant log output

No response

Anything else?

The entire threaddump can be found here: cilium-wh7xn.log

This was captured by setting a preStop lifecycle hook to run kill -s ABRT 1.

This issue started occurring with v1.12.2 for us.

cc @joaoubaldo @carloscastrojumo

From our dashboards, this is what we can observe: The restart of the agent/agent going unhealthy happens shortly after a large amount of endpoints not ready.

We have other metrics and logs if needed so let us know if you need more information for troubleshooting.

Would you recommend running an image compiled with the deadlock detection flag in production? We unfortunately cannot replicate this issue in a testing cluster.

Code of Conduct

I agree to follow this project’s Code of Conduct

About this issue

Original URL
State: closed
Created a year ago
Comments: 15 (11 by maintainers)

Commits related to this issue

ipam/crd: Fix panic due to concurrent map read and map write This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSize` required that the `crdAllocator` mutex was held. This howe... — committed to gandro/cilium by gandro a year ago
ipam/crd: Fix panic due to concurrent map read and map write This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSize` required that the `crdAllocator` mutex was held. This howe... — committed to cilium/cilium by gandro a year ago
ipam/crd: Fix panic due to concurrent map read and map write [ upstream commit e3a78b0d8e692ba375b35e74d2ca699f8a9e79bb ] This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSi... — committed to pchaigno/cilium by gandro a year ago
ipam/crd: Fix panic due to concurrent map read and map write [ upstream commit e3a78b0d8e692ba375b35e74d2ca699f8a9e79bb ] This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSi... — committed to cilium/cilium by gandro a year ago
ipam/crd: Fix panic due to concurrent map read and map write [ upstream commit e3a78b0d8e692ba375b35e74d2ca699f8a9e79bb ] This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSi... — committed to pchaigno/cilium by gandro a year ago
ipam/crd: Fix panic due to concurrent map read and map write [ upstream commit e3a78b0d8e692ba375b35e74d2ca699f8a9e79bb ] This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSi... — committed to cilium/cilium by gandro a year ago
ipam/crd: Fix panic due to concurrent map read and map write [ upstream commit e3a78b0d8e692ba375b35e74d2ca699f8a9e79bb ] This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSi... — committed to sayboras/cilium by gandro a year ago
ipam/crd: Fix panic due to concurrent map read and map write [ upstream commit e3a78b0d8e692ba375b35e74d2ca699f8a9e79bb ] This fixes a panic in the `totalPoolSize` function. Previously, `totalPoolSi... — committed to cilium/cilium by gandro a year ago

Most upvoted comments

Hello @gandro

Just confirming that in the last ~74 days, we have not had a single occurrence of the crash using our custom image based on v1.12.9.

#26242 also looks similar. We are in the process of upgrading to v1.13 so I will look out for it!

Thanks for your work on this!

@lukaselmer I would therefore recommend you try out 1.12.9 if you’re experiencing similar issues. Like I mentioned, we haven’t had an issue in > 74 days, which is pretty good going.

mantoine96 on Jul 18, 2023

Note that we fixed another related issue https://github.com/cilium/cilium/pull/26242 which however is still in the process of getting backported to v1.13

gandro on Jul 18, 2023

We’ve merged a deadlock fix around IPCache and FQDN (#24672) which potentially sounds like this was the issue here. Please test again with v1.12.9 which should be released next week.

gandro on Apr 12, 2023

Hey @gandro

Thank you so much for looking into this issue. We indeed saw the fatal error: concurrent map read and map write but were debating whether it might actually have occurred because of the kill -s ABRT.

I’ve checked https://github.com/cilium/cilium/pull/23377 and we have indeed seen some occurrences of this log message followed by agent restarts (3 in Dev, 1 in Prod, in the last ~15 days). So we will definitely get this into our image.

We haven’t gotten any other threadd umps yet but we will add them here when we get them.

Thanks again!

mantoine96 on Feb 13, 2023

I’ve opened a PR to fix the fatal error: concurrent map read and map write issue here: https://github.com/cilium/cilium/pull/23713

It’s still possible however that you also observed a deadlock. Thus, if you have additional threadumps from other runs that would be helpful to confirm that there are indeed no other unknown problems.

gandro on Feb 13, 2023

Would you recommend running an image compiled with the deadlock detection flag in production? We unfortunately cannot replicate this issue in a testing cluster.

I checked back with my colleagues and unfortunately, we do not recommend running lockdebug in production. It’s overhead for each locking operation is too high. Fetching a threadump as you did is probably the better approach.

gandro on Feb 13, 2023