cilium: RE: Cilium v1.12.2 API issues [Errors encountered while deleting endpoint], [unable to create endpoint, putEndpointIdTooManyRequests]
Is there an existing issue for this?
- I have searched the existing issues
What happened?
From time to time we see API issues with cilium v1.12.2
on EKS v1.23
, cilium agent is unresponsive on a particular node which then disrupts all networking on this node.
Seems to be related to a closed issue - https://github.com/cilium/cilium/issues/19440
Our only way to fix it at the moment is to cordon off and delete such a node.
NOTE: we were already experiencing this issue on v1.12.1 and tried to upgrade to see if it fixes but that did not help us.
Cilium Version
v1.12.1 and v1.12.2
Kernel Version
Linux ip-10-32-117-219.eu-central-1.compute.internal 5.10.130 #1 SMP Tue Aug 30 01:05:31 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Kubernetes Version
v1.23.7-eks-7709a84
Server Version: version.Info{Major:"1", Minor:"23+", GitVersion:"v1.23.10-eks-15b7512", GitCommit:"cd6399691d9b1fed9ec20c9c5e82f5993c3f42cb", GitTreeState:"clean", BuildDate:"2022-08-31T19:17:01Z", GoVersion:"go1.17.13", Compiler:"gc", Platform:"linux/amd64"}
Sysdump
No response
Relevant log output
2022-10-06 05:03:29.445,"level=warning msg=""Errors encountered while deleting endpoint"" error=""Cilium API client timeout exceeded"" eventUUID=5034197b-71e3-47c1-9d1b-66441852b46f subsys=cilium-cni"
2022-10-06 05:01:59.327,"level=warning msg=""Errors encountered while deleting endpoint"" error=""Cilium API client timeout exceeded"" eventUUID=892050ad-1e3f-4660-9304-1363d3d9e2e9 subsys=cilium-cni"
2022-10-06 04:59:54.405,"level=warning msg=""Errors encountered while deleting endpoint"" error=""Cilium API client timeout exceeded"" eventUUID=cb08d65f-233b-4ea5-a2ea-ea4ef8f7bc6c subsys=cilium-cni"
2022-10-05 19:37:26.702,"level=warning msg=""Errors encountered while deleting endpoint"" error=""[DELETE /endpoint/{id}][404] deleteEndpointIdNotFound "" eventUUID=c1e23610-7e55-4142-92f2-bac6802a8333 subsys=cilium-cni"
2022-10-05 19:37:26.621,"level=warning msg=""Errors encountered while deleting endpoint"" error=""[DELETE /endpoint/{id}][404] deleteEndpointIdNotFound "" eventUUID=377ddd9f-5ac3-4caa-ae13-b7bc52237f68 subsys=cilium-cni"
2022-10-05 19:30:26.494,"level=warning msg=""Errors encountered while deleting endpoint"" error=""[DELETE /endpoint/{id}][404] deleteEndpointIdNotFound "" eventUUID=fface7af-13ed-44f1-9b78-12da9709e878 subsys=cilium-cni"
2022-10-05 19:30:26.414,"level=warning msg=""Errors encountered while deleting endpoint"" error=""[DELETE /endpoint/{id}][404] deleteEndpointIdNotFound "" eventUUID=0395ae19-4fdd-4dd9-8240-52884251e208 subsys=cilium-cni"
2022-10-05 19:07:26.202,"level=warning msg=""Errors encountered while deleting endpoint"" error=""[DELETE /endpoint/{id}][404] deleteEndpointIdNotFound "" eventUUID=220ceb10-b62b-4d47-bdba-627775b252a6 subsys=cilium-cni"
2022-10-05 19:07:26.138,"level=warning msg=""Errors encountered while deleting endpoint"" error=""[DELETE /endpoint/{id}][404] deleteEndpointIdNotFound "" eventUUID=70b8453b-fd74-4f9f-a9ef-1d04cb2c312c subsys=cilium-cni"
2022-10-05 18:56:25.982,"level=warning msg=""Errors encountered while deleting endpoint"" error=""[DELETE /endpoint/{id}][404] deleteEndpointIdNotFound "" eventUUID=046818cb-220e-4795-802a-13e03dcc7069 subsys=cilium-cni"
2022-10-05 18:56:25.916,"level=warning msg=""Errors encountered while deleting endpoint"" error=""[DELETE /endpoint/{id}][404] deleteEndpointIdNotFound "" eventUUID=641a8561-f066-4e99-9bee-9002dcdb6f28 subsys=cilium-cni"
2022-10-05 18:54:25.833,"level=warning msg=""Errors encountered while deleting endpoint"" error=""[DELETE /endpoint/{id}][404] deleteEndpointIdNotFound "" eventUUID=f33b93fe-3373-4f70-89a2-655f66979ab0 subsys=cilium-cni"
Warning FailedCreatePodSandBox 83s (x147 over 3h40m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "24467e4fb58266c69f42cb4a9bc0cb8232a25c463b319dbd2a70202ac1b6dc6f": plugin type="cilium-cni" name="cilium" failed (add): unable to create endpoint: [PUT /endpoint/{id}][429] putEndpointIdTooManyRequests
Anything else?
Our EKS nodes run on bottle-rocket
AMI’s
Code of Conduct
- I agree to follow this project’s Code of Conduct
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 15
- Comments: 22 (5 by maintainers)
Follow-up. This could probably save some time for ppl who need it in yaml format:
Do you mean
1.13.x
?Same issue here with cilium
v1.12.2
. Increasing the ratelimit numbers stopped the putEndpointIdTooManyRequests events, but now the issue turned to sth else:When checking cliliium logs on the nodes:
@ajaykumarmandapati @ikarlashov It’s worth mentioning that increasing limits on endpoint-delete probably works, however for endpoint-create, I’d very much doubt that the underlying kernel can handle this many endpoint creations at once.
This is because loading BPF programs into the kernel is serialized, i.e. cannot be done in parallel, so if there are a burst of pods, loading their BPF programs into the kernel will immediately serialize their creation from the Cilium point of view. To be clear, Cilium can handle parallel pod creations, but when it reaches the BPF program loading stage of the creation, it will be serialized.
This rate-limiting was put in place to actually reduce the pressure of Cilium on the kernel and return that pressure back to the API calls to create pods. Without this, the system can get overwhelmed and cause undesirable behavior as it’s overloaded. Just an FYI.
we got around this by increasing API limits for cilium via extraConfig e.g.
extraConfig = { "api-rate-limit" : "{\"endpoint-create\": \"rate-limit:200/s,rate-burst:200,parallel-requests:200,auto-adjust:true\", \"endpoint-delete\": \"rate-limit:200/s,rate-burst:200,parallel-requests:200,auto-adjust:true\", \"endpoint-get\": \"rate-limit:200/s,rate-burst:200,parallel-requests:200,auto-adjust:true\", \"endpoint-list\": \"rate-limit:200/s,rate-burst:200,parallel-requests:200,auto-adjust:true\", \"endpoint-patch\": \"rate-limit:200/s,rate-burst:200,parallel-requests:200,auto-adjust:true\"}" }