cilium: Cilium API Connectivity issue - unrecoverable, unable to create/delete ciliumendpoint
Is there an existing issue for this?
- I have searched the existing issues
What happened?
Cilium-agent becomes unrecoverable, unable to create/delete endpoints permanently and the cilium-agent pod had to be manually reset.
Cilium Version
Cilium 1.11.2
Kernel Version
Linux ip-10-0-43-16.ec2.internal 5.4.181-99.354.amzn2.x86_64 #1 SMP Wed Mar 2 18:50:46 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Kubernetes Version
Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.16-eks-25803e", GitCommit:"25803e8d008d5fa99b8a35a77d99b705722c0c8c", GitTreeState:"clean", BuildDate:"2022-02-16T23:37:16Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
Sysdump
cilium-sysdump-20220420-155943.zip
Relevant log output
Apr 13 23:33:13 ip-10-0-43-16 kubelet: E0413 23:33:13.310406 3249 cni.go:366] Error adding app-system_app-client--api-7b74cd4bc-x7jx7/e1243d32dfda4fe6b66a75a24605b3f50e066b93232f13a1e044cc9cbd430629 to network cilium-cni/cilium: Unable to create endpoint: Cilium API client timeout exceeded
Apr 13 23:33:13 ip-10-0-43-16 kubelet: level=warning msg="Errors encountered while deleting endpoint" error="[DELETE /endpoint/{id}][404] deleteEndpointIdNotFound " subsys=cilium-cni
Apr 13 23:33:13 ip-10-0-43-16 kubelet: E0413 23:33:13.874516 3249 cni.go:366] Error adding app-system_app-controller-manager-deployment-5b6cbb66fc-792rn/6eea2294a5d3491df8983622eb0fc6713c0afe54b0016829a2dee6506044f731 to network cilium-cni/cilium: Unable to create endpoint: Cilium API client timeout exceeded
Apr 13 23:33:13 ip-10-0-43-16 kubelet: level=warning msg="Errors encountered while deleting endpoint" error="[DELETE /endpoint/{id}][404] deleteEndpointIdNotFound " subsys=cilium-cni
Apr 13 23:33:14 ip-10-0-43-16 kubelet: E0413 23:33:14.193342 3249 cni.go:366] Error adding app-system_app-docker-registry-6f4cd68477-7z2xj/344e44cf3ffbe4d67a6c2875274b31a55300785b9603aa145b21a6173b146e08 to network cilium-cni/cilium: Unable to create endpoint: Cilium API client timeout exceeded
Apr 13 23:33:14 ip-10-0-43-16 kubelet: level=warning msg="Errors encountered while deleting endpoint" error="[DELETE /endpoint/{id}][404] deleteEndpointIdNotFound " subsys=cilium-cni
Apr 13 23:33:15 ip-10-0-43-16 kubelet: E0413 23:33:15.177225 3249 cni.go:366] Error adding app-system_app-console-869b9f7ff5-pdmb8/a96ec3d2c9c1be88164749cf57efbc52629e0578fbd1d3b8cd28a3954fbc53ef to network cilium-cni/cilium: Unable to create endpoint: Cilium API client timeout exceeded
Apr 13 23:33:15 ip-10-0-43-16 kubelet: level=warning msg="Errors encountered while deleting endpoint" error="[DELETE /endpoint/{id}][404] deleteEndpointIdNotFound " subsys=cilium-cni
Apr 13 23:33:44 ip-10-0-43-16 kubelet: E0413 23:33:44.269198 3249 cni.go:366] Error adding app-system_app-console-869b9f7ff5-pdmb8/81a7f831f0e5e4a4a8b85e8ec45d409ed3273d4939df887d09a2f387091d503b to network cilium-cni/cilium: Unable to create endpoint: [PUT /endpoint/{id}][429] putEndpointIdTooManyRequests
Apr 13 23:33:44 ip-10-0-43-16 kubelet: level=warning msg="Errors encountered while deleting endpoint" error="[DELETE /endpoint/{id}][404] deleteEndpointIdNotFound " subsys=cilium-cni
Apr 13 23:33:48 ip-10-0-43-16 kubelet: level=warning msg="Errors encountered while deleting endpoint" error="Cilium API client timeout exceeded" subsys=cilium-cni
Apr 13 23:33:49 ip-10-0-43-16 kubelet: level=warning msg="Errors encountered while deleting endpoint" error="[DELETE /endpoint/{id}][404] deleteEndpointIdNotFound " subsys=cilium-cni
Apr 13 23:33:49 ip-10-0-43-16 kubelet: level=warning msg="Unable to enter namespace \"\", will not delete interface" error="failed to Statfs \"\": no such file or directory" subsys=cilium-cni
Apr 13 23:34:07 ip-10-0-43-16 kubelet: E0413 23:34:07.041523 3249 cni.go:366] Error adding app-system_app-console-869b9f7ff5-pdmb8/9992a11a9379e7736a326a2a9730dfc07873eded1555186b943fa8ea986165fd to network cilium-cni/cilium: Unable to create endpoint: [PUT /endpoint/{id}][429] putEndpointIdTooManyRequests
Anything else?
Note, these logs came from the kubelet output on the host, so there is a small mix of logs from the kubelet process itself and the cilium agent log output.
The logs above happened while a deployment occurred on the cluster. Many pods were being deleted/recreated at the same time. This cilium pod got into an unrecoverable state and no pods were able to start successfully on the node. We were able to recover the system by deleting the cilium-agent pod on that node and the new pod starting up was able to process the requests to create the endpoints for the new containers and delete the endpoints for the old containers.
The behavior here looked similar to #6947, but this issue appears to have been addressed many releases ago.
I will try to get a sysdump the next time this issue occurs. We removed the cilium agent pod here to restore service, and so are not able to provide the sysdump from this particular occurrence. This has happened a few times, so I’ll try to run that next time we have this impact.
Code of Conduct
- I agree to follow this project’s Code of Conduct
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 8
- Comments: 15 (11 by maintainers)
Hello, I would like to report that today I was facing similar issues on combination of cilium 1.12.1 and EKS 1.23.7. I would kindly ask to reopen this issue @jmcshane
Hello @ajaykumarmandapati!
I made a dirty fix - increased the limits up to 200.
@joaoubaldo well, I guess two weeks. I’m going to close this issue now as it has been two weeks and we have not yet encountered this problem in a cluster with 1.11.4 installed.
@joaoubaldo we’re rolling out 1.11.4 across our environment as we speak. As this is intermittent and we are unable to detect the cause, it is hard to tell whether the issue is resolved for some time. So far though, we have not experienced this in 1.11.4.
I will report back on this ticket in a week.