cilium: endpoint regeneration stuck on key allocation

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Endpoints get stuck in waiting-to-regenerate because they cannot allocate a key.

Restarting cilium agent fixes the issue.

Cilium Version

1.12.3 1c466d2 2022-10-12T11:33:37+01:00 go version go1.18.6 linux/amd64

Kernel Version

5.15.0-1017-aws

Kubernetes Version

Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.13-eks-15b7512", GitCommit:"94138dfbea757d7aaf3b205419578ef186dd5efb", GitTreeState:"clean", BuildDate:"2022-08-31T19:15:48Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

Sysdump

No response

Relevant log output

level=warning msg="Key allocation attempt failed" attempt=0 error="unable to allocate ID 70517 for key [..labelSet...]: ciliumidentities.cilium.io \"70517\" already exists" key="[
...labelSet...]" subsys=allocator
level=warning msg="Key allocation attempt failed" attempt=1 error="slave key creation failed '...labelSet...': identity (id:\"96965\",key:\"[...labelSet...]\") does not exist"       key="[...labelSet...]" subsys=allocator


### Anything else?

_No response_

### Code of Conduct

- [X] I agree to follow this project's Code of Conduct

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 15 (11 by maintainers)

Most upvoted comments

I’m wondering if the collection of deadlocks that were fixed in the past few Cilium v1.12.x releases (1.12.8 and 1.12.9) reared their heads in this issue and manifested in a different way. Anyway, worth trying out the later versions and see if this issue still exists. A gops stack dump would be useful if it occurs again.

Though checking now, we do get a steady stream of: level=warning msg="Key allocation attempt failed" attempt=0 error="slave key creation failed

Just the stuck endpoints part doesn’t happen too often.

Is it possible for you to capture the gops output of the Agent when this occurs again?

I can try. It may take a bit because it doesn’t happen often and eventually fixes itself, so I have to catch it.