cilium: Cilium API Connectivity issue - unrecoverable, unable to create/delete ciliumendpoint

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Cilium-agent becomes unrecoverable, unable to create/delete endpoints permanently and the cilium-agent pod had to be manually reset.

Cilium Version

Cilium 1.11.2

Kernel Version

Linux ip-10-0-43-16.ec2.internal 5.4.181-99.354.amzn2.x86_64 #1 SMP Wed Mar 2 18:50:46 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Kubernetes Version

Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.16-eks-25803e", GitCommit:"25803e8d008d5fa99b8a35a77d99b705722c0c8c", GitTreeState:"clean", BuildDate:"2022-02-16T23:37:16Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}

Sysdump

cilium-sysdump-20220420-155943.zip

Relevant log output

Apr 13 23:33:13 ip-10-0-43-16 kubelet: E0413 23:33:13.310406    3249 cni.go:366] Error adding app-system_app-client--api-7b74cd4bc-x7jx7/e1243d32dfda4fe6b66a75a24605b3f50e066b93232f13a1e044cc9cbd430629 to network cilium-cni/cilium: Unable to create endpoint: Cilium API client timeout exceeded
Apr 13 23:33:13 ip-10-0-43-16 kubelet: level=warning msg="Errors encountered while deleting endpoint" error="[DELETE /endpoint/{id}][404] deleteEndpointIdNotFound " subsys=cilium-cni
Apr 13 23:33:13 ip-10-0-43-16 kubelet: E0413 23:33:13.874516    3249 cni.go:366] Error adding app-system_app-controller-manager-deployment-5b6cbb66fc-792rn/6eea2294a5d3491df8983622eb0fc6713c0afe54b0016829a2dee6506044f731 to network cilium-cni/cilium: Unable to create endpoint: Cilium API client timeout exceeded
Apr 13 23:33:13 ip-10-0-43-16 kubelet: level=warning msg="Errors encountered while deleting endpoint" error="[DELETE /endpoint/{id}][404] deleteEndpointIdNotFound " subsys=cilium-cni
Apr 13 23:33:14 ip-10-0-43-16 kubelet: E0413 23:33:14.193342    3249 cni.go:366] Error adding app-system_app-docker-registry-6f4cd68477-7z2xj/344e44cf3ffbe4d67a6c2875274b31a55300785b9603aa145b21a6173b146e08 to network cilium-cni/cilium: Unable to create endpoint: Cilium API client timeout exceeded
Apr 13 23:33:14 ip-10-0-43-16 kubelet: level=warning msg="Errors encountered while deleting endpoint" error="[DELETE /endpoint/{id}][404] deleteEndpointIdNotFound " subsys=cilium-cni
Apr 13 23:33:15 ip-10-0-43-16 kubelet: E0413 23:33:15.177225    3249 cni.go:366] Error adding app-system_app-console-869b9f7ff5-pdmb8/a96ec3d2c9c1be88164749cf57efbc52629e0578fbd1d3b8cd28a3954fbc53ef to network cilium-cni/cilium: Unable to create endpoint: Cilium API client timeout exceeded
Apr 13 23:33:15 ip-10-0-43-16 kubelet: level=warning msg="Errors encountered while deleting endpoint" error="[DELETE /endpoint/{id}][404] deleteEndpointIdNotFound " subsys=cilium-cni

Apr 13 23:33:44 ip-10-0-43-16 kubelet: E0413 23:33:44.269198    3249 cni.go:366] Error adding app-system_app-console-869b9f7ff5-pdmb8/81a7f831f0e5e4a4a8b85e8ec45d409ed3273d4939df887d09a2f387091d503b to network cilium-cni/cilium: Unable to create endpoint: [PUT /endpoint/{id}][429] putEndpointIdTooManyRequests
Apr 13 23:33:44 ip-10-0-43-16 kubelet: level=warning msg="Errors encountered while deleting endpoint" error="[DELETE /endpoint/{id}][404] deleteEndpointIdNotFound " subsys=cilium-cni
Apr 13 23:33:48 ip-10-0-43-16 kubelet: level=warning msg="Errors encountered while deleting endpoint" error="Cilium API client timeout exceeded" subsys=cilium-cni
Apr 13 23:33:49 ip-10-0-43-16 kubelet: level=warning msg="Errors encountered while deleting endpoint" error="[DELETE /endpoint/{id}][404] deleteEndpointIdNotFound " subsys=cilium-cni
Apr 13 23:33:49 ip-10-0-43-16 kubelet: level=warning msg="Unable to enter namespace \"\", will not delete interface" error="failed to Statfs \"\": no such file or directory" subsys=cilium-cni
Apr 13 23:34:07 ip-10-0-43-16 kubelet: E0413 23:34:07.041523    3249 cni.go:366] Error adding app-system_app-console-869b9f7ff5-pdmb8/9992a11a9379e7736a326a2a9730dfc07873eded1555186b943fa8ea986165fd to network cilium-cni/cilium: Unable to create endpoint: [PUT /endpoint/{id}][429] putEndpointIdTooManyRequests

Anything else?

Note, these logs came from the kubelet output on the host, so there is a small mix of logs from the kubelet process itself and the cilium agent log output.

The logs above happened while a deployment occurred on the cluster. Many pods were being deleted/recreated at the same time. This cilium pod got into an unrecoverable state and no pods were able to start successfully on the node. We were able to recover the system by deleting the cilium-agent pod on that node and the new pod starting up was able to process the requests to create the endpoints for the new containers and delete the endpoints for the old containers.

The behavior here looked similar to #6947, but this issue appears to have been addressed many releases ago.

I will try to get a sysdump the next time this issue occurs. We removed the cilium agent pod here to restore service, and so are not able to provide the sysdump from this particular occurrence. This has happened a few times, so I’ll try to run that next time we have this impact.

Code of Conduct

  • I agree to follow this project’s Code of Conduct

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 8
  • Comments: 15 (11 by maintainers)

Most upvoted comments

Hello, I would like to report that today I was facing similar issues on combination of cilium 1.12.1 and EKS 1.23.7. I would kindly ask to reopen this issue @jmcshane

kubelet[3766]: E0906 14:52:48.770755    3766 cni.go:362] "Error adding pod to network" err="Unable to create endpoint: [PUT /endpoint/{id}][429] putEndpointIdTooManyRequests " 
kubelet[3766]: level=warning msg="Errors encountered while deleting endpoint" error="[DELETE /endpoint/{id}][404] deleteEndpointIdNotFound " subsys=cilium-cni
kubelet[3766]: E0906 14:52:48.883699    3766 remote_runtime.go:209] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to set up sandbox container \"<REDACTED>\" network for pod \"<REDACTED>\": networkPlugin cni failed to set up pod \"<REDACTED>\" network: Unable to create endpoint: [PUT /endpoint/{id}][429] putEndpointIdTooManyRequests "
kubelet[3766]: E0906 14:52:48.883770    3766 kuberuntime_sandbox.go:70] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to set up sandbox container \"<REDACTED>\" network for pod \"<REDACTED>\": networkPlugin cni failed to set up pod \"<REDACTED>\" network: Unable to create endpoint: [PUT /endpoint/{id}][429] putEndpointIdTooManyRequests " 
kubelet[3766]: E0906 14:52:48.883805    3766 kuberuntime_manager.go:833] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to set up sandbox container \"<REDACTED>\" network for pod \"<REDACTED>\": networkPlugin cni failed to set up pod \"<REDACTED>\" network: Unable to create endpoint: [PUT /endpoint/{id}][429] putEndpointIdTooManyRequests " 
kubelet[3766]: E0906 14:52:48.883882    3766 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"<REDACTED>\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"<REDACTED>\\\": rpc error: code = Unknown desc = failed to set up sandbox container \\\"<REDACTED>\\\" network for pod \\\"<REDACTED>\\\": networkPlugin cni failed to set up pod \\\"<REDACTED>\\\" network: Unable to create endpoint: [PUT /endpoint/{id}][429] putEndpointIdTooManyRequests \"" 
kubelet[3766]: I0906 14:52:49.379691    3766 docker_sandbox.go:402] "Failed to read pod IP from plugin/docker" err="networkPlugin cni failed on the status hook for pod \"<REDACTED>\": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container \"<REDACTED>\""
kubelet[3766]: I0906 14:52:49.383077    3766 kubelet.go:2143] "SyncLoop (PLEG): event for pod" 
kubelet[3766]: I0906 14:52:49.383120    3766 pod_container_deletor.go:79] "Container not found in pod's containers" containerID="<REDACTED>"
kubelet[3766]: I0906 14:52:49.383448    3766 kuberuntime_manager.go:506] "No ready sandbox for pod can be found. Need to start a new one" 
kubelet[3766]: I0906 14:52:49.385819    3766 cni.go:334] "CNI failed to retrieve network namespace path" err="cannot find network namespace for the terminated container \"<REDACTED>\""
kubelet[3766]: level=warning msg="Errors encountered while deleting endpoint" error="[DELETE /endpoint/{id}][404] deleteEndpointIdNotFound " subsys=cilium-cni
kubelet[3766]: level=warning msg="Unable to enter namespace \"\", will not delete interface" error="failed to Statfs \"\": no such file or directory" subsys=cilium-cni
kubelet[3766]: I0906 14:52:50.420806    3766 kubelet.go:2143] "SyncLoop (PLEG): event for pod" 
kubelet[3766]: E0906 14:52:52.858765    3766 cni.go:362] "Error adding pod to network" err="Unable to create endpoint: [PUT /endpoint/{id}][429] putEndpointIdTooManyRequests " 
kubelet[3766]: level=warning msg="Errors encountered while deleting endpoint" error="[DELETE /endpoint/{id}][404] deleteEndpointIdNotFound " subsys=cilium-cni
kubelet[3766]: E0906 14:52:52.992497    3766 remote_runtime.go:209] "RunPodSandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to set up sandbox container \"<REDACTED>\" network for pod \"<REDACTED>\": networkPlugin cni failed to set up pod \"<REDACTED>\" network: Unable to create endpoint: [PUT /endpoint/{id}][429] putEndpointIdTooManyRequests "
kubelet[3766]: E0906 14:52:52.992571    3766 kuberuntime_sandbox.go:70] "Failed to create sandbox for pod" err="rpc error: code = Unknown desc = failed to set up sandbox container \"<REDACTED>\" network for pod \"<REDACTED>\": networkPlugin cni failed to set up pod \"<REDACTED>\" network: Unable to create endpoint: [PUT /endpoint/{id}][429] putEndpointIdTooManyRequests " 
kubelet[3766]: E0906 14:52:52.992609    3766 kuberuntime_manager.go:833] "CreatePodSandbox for pod failed" err="rpc error: code = Unknown desc = failed to set up sandbox container \"<REDACTED>\" network for pod \"<REDACTED>\": networkPlugin cni failed to set up pod \"<REDACTED>\" network: Unable to create endpoint: [PUT /endpoint/{id}][429] putEndpointIdTooManyRequests " 
kubelet[3766]: E0906 14:52:52.992695    3766 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \"<REDACTED>\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"<REDACTED>\\\": rpc error: code = Unknown desc = failed to set up sandbox container \\\"<REDACTED>\\\" network for pod \\\"<REDACTED>\\\": networkPlugin cni failed to set up pod \\\"<REDACTED>\\\" network: Unable to create endpoint: [PUT /endpoint/{id}][429] putEndpointIdTooManyRequests \"" 
kubelet[3766]: I0906 14:52:53.488938    3766 docker_sandbox.go:402] "Failed to read pod IP from plugin/docker" err="networkPlugin cni failed on the status hook for pod \"<REDACTED>\": CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container \"<REDACTED>\""
kubelet[3766]: I0906 14:52:53.492276    3766 kubelet.go:2143] "SyncLoop (PLEG): event for pod" 
kubelet[3766]: I0906 14:52:53.492328    3766 pod_container_deletor.go:79] "Container not found in pod's containers" containerID="<REDACTED>"
kubelet[3766]: I0906 14:52:53.492589    3766 kuberuntime_manager.go:506] "No ready sandbox for pod can be found. Need to start a new one" 
kubelet[3766]: I0906 14:52:53.494904    3766 cni.go:334] "CNI failed to retrieve network namespace path" err="cannot find network namespace for the terminated container \"<REDACTED>\""
kubelet[3766]: level=warning msg="Errors encountered while deleting endpoint" error="[DELETE /endpoint/{id}][404] deleteEndpointIdNotFound " subsys=cilium-cni
kubelet[3766]: level=warning msg="Unable to enter namespace \"\", will not delete interface" error="failed to Statfs \"\": no such file or directory" subsys=cilium-cni
kubelet[3766]: I0906 14:52:54.518687    3766 kubelet.go:2143] "SyncLoop (PLEG): event for pod" 

Hello @ajaykumarmandapati!

I made a dirty fix - increased the limits up to 200.

  set {
    name  = "extraArgs[0]"
    value = "--api-rate-limit=endpoint-create=rate-limit:200/s\\,rate-burst:200\\,parallel-requests:200"
    type  = "string"
  }
  set {
    name  = "extraArgs[1]"
    value = "--api-rate-limit=endpoint-delete=rate-limit:200/s\\,rate-burst:200\\,parallel-requests:200"
    type  = "string"
  }
  set {
    name  = "extraArgs[2]"
    value = "--api-rate-limit=endpoint-get=rate-limit:200/s\\,rate-burst:200\\,parallel-requests:200"
    type  = "string"
  }
  set {
    name  = "extraArgs[3]"
    value = "--api-rate-limit=endpoint-patch=rate-limit:200/s\\,rate-burst:200\\,parallel-requests:200"
    type  = "string"
  }
  set {
    name  = "extraArgs[4]"
    value = "--api-rate-limit=endpoint-list=rate-limit:200/s\\,rate-burst:200\\,parallel-requests:200"
    type  = "string"
  }

@joaoubaldo well, I guess two weeks. I’m going to close this issue now as it has been two weeks and we have not yet encountered this problem in a cluster with 1.11.4 installed.

@joaoubaldo we’re rolling out 1.11.4 across our environment as we speak. As this is intermittent and we are unable to detect the cause, it is hard to tell whether the issue is resolved for some time. So far though, we have not experienced this in 1.11.4.

I will report back on this ticket in a week.