cilium: Deadlock in Cilium Agent

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

After a while running (typically for us, several days), a Cilium agent will start to become unresponsive, failing its liveness checks which causes it to be restarted by Kubernetes. Since we leverage FQDN rules a lot, this ends up causing network disruption on the given host.

Cilium Version

1.12.5 820a3086ad 2022-12-22T16:16:56+00:00 go version go1.18.9 linux/amd64

This is essentially the released 1.12.5 + this PR backported: https://github.com/cilium/cilium/pull/22252

Kernel Version

5.4.228-131.415.amzn2.x86_64

Kubernetes Version

v1.23.14-eks-ffeb93d

Sysdump

No response

Relevant log output

No response

Anything else?

The entire threaddump can be found here: cilium-wh7xn.log

This was captured by setting a preStop lifecycle hook to run kill -s ABRT 1.

This issue started occurring with v1.12.2 for us.

cc @joaoubaldo @carloscastrojumo

From our dashboards, this is what we can observe: image The restart of the agent/agent going unhealthy happens shortly after a large amount of endpoints not ready.

We have other metrics and logs if needed so let us know if you need more information for troubleshooting.

Would you recommend running an image compiled with the deadlock detection flag in production? We unfortunately cannot replicate this issue in a testing cluster.

Code of Conduct

  • I agree to follow this project’s Code of Conduct

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 15 (11 by maintainers)

Commits related to this issue

Most upvoted comments

Hello @gandro

Just confirming that in the last ~74 days, we have not had a single occurrence of the crash using our custom image based on v1.12.9.

#26242 also looks similar. We are in the process of upgrading to v1.13 so I will look out for it!

Thanks for your work on this!

@lukaselmer I would therefore recommend you try out 1.12.9 if you’re experiencing similar issues. Like I mentioned, we haven’t had an issue in > 74 days, which is pretty good going.

Note that we fixed another related issue https://github.com/cilium/cilium/pull/26242 which however is still in the process of getting backported to v1.13

We’ve merged a deadlock fix around IPCache and FQDN (#24672) which potentially sounds like this was the issue here. Please test again with v1.12.9 which should be released next week.

Hey @gandro

Thank you so much for looking into this issue. We indeed saw the fatal error: concurrent map read and map write but were debating whether it might actually have occurred because of the kill -s ABRT.

I’ve checked https://github.com/cilium/cilium/pull/23377 and we have indeed seen some occurrences of this log message followed by agent restarts (3 in Dev, 1 in Prod, in the last ~15 days). So we will definitely get this into our image.

We haven’t gotten any other threadd umps yet but we will add them here when we get them.

Thanks again!

I’ve opened a PR to fix the fatal error: concurrent map read and map write issue here: https://github.com/cilium/cilium/pull/23713

It’s still possible however that you also observed a deadlock. Thus, if you have additional threadumps from other runs that would be helpful to confirm that there are indeed no other unknown problems.

Would you recommend running an image compiled with the deadlock detection flag in production? We unfortunately cannot replicate this issue in a testing cluster.

I checked back with my colleagues and unfortunately, we do not recommend running lockdebug in production. It’s overhead for each locking operation is too high. Fetching a threadump as you did is probably the better approach.