amazon-vpc-cni-k8s: Race condition between CNI plugin install and aws-k8s-agent startup

My employer runs GitLab CI using the Kubernetes executor which runs each CI job as a separate pod. Since the workload is bursty, cluster-autoscaler is also enabled. When certain CI stages begin, 100+ pods are created rapidly and a large number of them are unscheduleable until the autoscaler has spun up enough new nodes. GitLab CI will wait up to 10m for the pod to start before assuming the job failed.

Occasionally, one of these pods will hit the 10m timeout with the following events according to kubectl describe po:

Events:
  Type     Reason                  Age                From                                                 Message
  ----     ------                  ----               ----                                                 -------
  Normal   Scheduled               14m                default-scheduler                                    Successfully assigned gitlab-ci-runner/runner-cad974a7-project-1-concurrent-0cvwc7 to ip-172-30-13-60.us-west-2.compute.internal

  Warning  FailedCreatePodSandBox  14m                kubelet, ip-172-30-13-60.us-west-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "c96fe97cd035ce410f537254a20102485c83eb275f3fc317f0e2b5ea5c4b7720" network for pod "runner-cad974a7-project-1-concurrent-0cvwc7": NetworkPlugin cni failed to set up pod "runner-cad974a7-project-1-concurrent-0cvwc7_gitlab-ci-runner" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "c96fe97cd035ce410f537254a20102485c83eb275f3fc317f0e2b5ea5c4b7720" network for pod "runner-cad974a7-project-1-concurrent-0cvwc7": NetworkPlugin cni failed to teardown pod "runner-cad974a7-project-1-concurrent-0cvwc7_gitlab-ci-runner" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
  Normal   SandboxChanged          4m (x46 over 14m)  kubelet, ip-172-30-13-60.us-west-2.compute.internal  Pod sandbox changed, it will be killed and re-created.

So, the pod failed during setup due to the CNI plugin failing to connect to aws-k8-cni. Looking at install-aws.sh, the CNI plugin is installed before aws-k8s-agent starts. kubelet assumes that CNI is ready as soon as the plugin is installed. If there are pods waiting to be scheduled, there is a narrow chance that a pod will be scheduled and CNI will fail because aws-k8s-agent hasn’t started responding yet.

Unfortunately, the CNI plugin seems to also report a failure during the sandbox cleanup (again due to connection refused when trying to do CNI teardown) which prevents the pod from being rescheduled or retried until the container is manually removed.

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 24
Comments: 26 (8 by maintainers)

Commits related to this issue

Update start script to wait for ipamd health Wait for the ipamd health check to be SERVING before copying in the CNI binary and config file. #282 — committed to mogren/amazon-vpc-cni-k8s by deleted user 5 years ago
Update start script to wait for ipamd health Wait for the ipamd health check to be SERVING before copying in the CNI binary and config file. #282 — committed to mogren/amazon-vpc-cni-k8s by deleted user 5 years ago
Update start script to wait for ipamd health Wait for the ipamd health check to be SERVING before copying in the CNI binary and config file. #282 — committed to mogren/amazon-vpc-cni-k8s by deleted user 5 years ago
Update start script to wait for ipamd health Wait for the ipamd health check to be SERVING before copying in the CNI binary and config file. #282 — committed to mogren/amazon-vpc-cni-k8s by deleted user 5 years ago
Update start script to wait for ipamd health Wait for the ipamd health check to be SERVING before copying in the CNI binary and config file. #282 — committed to mogren/amazon-vpc-cni-k8s by deleted user 5 years ago
Update start script to wait for ipamd health Wait for the ipamd health check to be SERVING before copying in the CNI binary and config file. #282 — committed to aws/amazon-vpc-cni-k8s by deleted user 5 years ago

Most upvoted comments

I believe one of the reasons more people are reporting this issue now than before, is that:

Kubernetes 1.13+ has a regression where the kubelet won’t retry PodSandbox creation for pods with a restartPolicy of Never (kubernetes/kubernetes/issues/79398) - meaning that pods get stuck as ContainerCreating after experiencing the CNI plugin race condition, and are not able to self-recover
The above regression has been fixed in Kubernetes 1.13.8, however EKS is still running 1.13.7
More people are now finally upgrading to EKS 1.13 now that it’s (a) generally available on AWS, (b) version 1.10 has been EOLed

@mogren since the fix for the Kubernetes regression is in kubelet, would it be possible to upgrade the version in the AMI to 1.13.8 as a stop-gap until the control plane can also be upgraded so that it’s in sync? (See aws/containers-roadmap/issues/411)

+11

edmorley on Aug 6, 2019

This is happening to us as well.

CNI version: 602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni:v1.5.0 EKS version: v1.13.7-eks-c57ff8

Is there a fix coming down the pipeline for this? It’s legitimately affecting production workloads right now.

sc250024 on Jul 16, 2019

I’m in the same situation as @kc8apf. We are using EKS+Autoscalar to run out Gitlab CI runner. The cluster version is 1.13 eks.2 and CNI version is 1.5.0.

Sometimes the pod is stuck at ContainerCreating stage and its description is like:

Events:
  Type     Reason                  Age              From                                                 Message
  ----     ------                  ----             ----                                                 -------
  Warning  FailedScheduling        4m (x2 over 4m)  default-scheduler                                    0/3 nodes are available: 1 node(s) didn't match node selector, 3 Insufficient cpu.
  Normal   TriggeredScaleUp        4m               cluster-autoscaler                                   pod triggered scale-up: [{gitlab-x86-64-asg 3->4 (max: 16)}]
  Warning  FailedScheduling        4m (x2 over 4m)  default-scheduler                                    0/4 nodes are available: 1 node(s) didn't match node selector, 1 node(s) had taints that the pod didn't tolerate, 3 Insufficient cpu.
  Normal   Scheduled               3m               default-scheduler                                    Successfully assigned default/runner-ex6ucuta-project-136-concurrent-48nnx7 to ip-172-31-2-101.us-west-2.compute.internal
  Warning  FailedCreatePodSandBox  3m               kubelet, ip-172-31-2-101.us-west-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "d6f10d64bd0133c14aa4e95cf50565c459198f652656b254dd09745c713e24e5" network for pod "runner-ex6ucuta-project-136-concurrent-48nnx7": NetworkPlugin cni failed to set up pod "runner-ex6ucuta-project-136-concurrent-48nnx7_default" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "d6f10d64bd0133c14aa4e95cf50565c459198f652656b254dd09745c713e24e5" network for pod "runner-ex6ucuta-project-136-concurrent-48nnx7": NetworkPlugin cni failed to teardown pod "runner-ex6ucuta-project-136-concurrent-48nnx7_default" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]

This only happens when a new node scaling-up is triggered by new jobs. After that, this node isn’t able to take any pods because of this CNI failure.

I compared the ipamd.log between a successful node and a failed node,

successful_ipamd.log.2019-07-02-18.log failed_ipamd.log.2019-07-02-18.log

An interesting finding is that, on the successful node, pod IP assignment happened after all the 30 IP have been added. Note that log Add/Update for Pod runner-xxx first showed up on line 156, after all the Added ENI(....

However on the failed node, the pod tries to get an IP(description not accurate) before the ENI was even added! Starting from line 58, it tried 12 times to get an IP but failed apparently because the CNI is not ready. After 12 attempts, it gave up and then the CNI had the chance to add 30 IP to the IP pool.

Any thought on how to coordinate the starting sequence? Thanks

t0ny-peng on Jul 2, 2019

We created a custom AMI with kubelet version 1.13.8 and we can confirm it fixes this issue. Also, control plane is now on 1.13.8, is the new AMI coming anytime soon?

jorge-vargas-tfs on Aug 9, 2019

@Mewzyk I just made a release candidate with a potential fix of this issue. It’s not yet a final release and we are still doing testing, but if you would like to test the build the instructions are here:

https://github.com/aws/amazon-vpc-cni-k8s/releases/tag/v1.5.1-rc1

mogren on Aug 2, 2019

@mogren I am facing a similar issue new pod is getting scheduled to a node.

Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "9f3dcceb6da3b53d10af353d83b374cf4359d38d85da5cc813925f58f02a232c" network for pod "nginx-deployment-56c774d974-kqgcm":
 networkPlugin cni failed to set up pod "nginx-deployment-56c774d974-kqgcm_nginx-example" network: add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", 
failed to clean up sandbox container "9f3dcceb6da3b53d10af353d83b374cf4359d38d85da5cc813925f58f02a232c" network for pod "nginx-deployment-56c774d974-kqgcm": networkPlugin cni failed to teardown pod "nginx-deployment-56c774d974-kqgcm_nginx-example"
 network: del cmd: error received from DelNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]

EKS: 1.17 aws cni : 1.7.5

tibin-mfl on Dec 18, 2020

I can confirm another configuration that works - as of today: 2019-08-13.

EKS K8S control plane is v1.13.8-eks-a977ba (from kubectl version)
EKS kubelets on the worker nodes are v1.13.7-eks-c57ff8 (from kubectl get node)
Manually updated aws-cni daemonset to v1.5.3 (kubectl edit ds -n kubesystem aws-cni)
Cluster autoscaler is v1.15.1

I posted a job with a specific affinity and toleration to target an ASG that had 0 nodes in it. Cluster autoscaler added a node to the ASG. When the node became ready, there was a little delay, but no FailedCreatePodSandBox errors in the pod events.

mattmi88 on Aug 14, 2019

Fixed for new nodes in v1.5.3, and merged back to master.

mogren on Aug 22, 2019

@mattmi88 Ditto!

sc250024 on Aug 14, 2019

Thanks for the update.

A note, the recent v1.5.1 release does not have the startup fix! I will make a v1.5.2 soon with the quick config fix plus the changes currently in v1.5.1-rc1.

mogren on Aug 4, 2019

A solution could be to run 2 containers in the daemonset. 1 to install the CNI and one that runs /app/aws-k8s-agent. The one that install the CNI can do a grpc health check or call an endpoint to check that /app/aws-k8s-agent is fully up and running before installing the CNI. Calico for example runs 2 containers in the daemonset: https://docs.projectcalico.org/v3.1/getting-started/kubernetes/installation/hosted/calico.yaml

However they do not have any coordination between them in the install CNI script: https://github.com/projectcalico/cni-plugin/blob/master/k8s-install/scripts/install-cni.sh

stijndehaes on Jul 9, 2019