amazon-vpc-cni-k8s: Containers stuck in ContainerCreating after configuring CNI Custom Networking on extended CIDR

Hi, We have an issue on CNI Custom networking & extended CIDR after nodes first boot if we have Pods pending for scheduling.

For example, on a simple nginx workload w/ 10 replicas, after node first boot we have :

> kubectl get pods
NAME                     READY   STATUS              RESTARTS   AGE
nginx-64f497f8fd-2sdl2   1/1     Running             0          5m
nginx-64f497f8fd-7m868   1/1     Running             0          5m
nginx-64f497f8fd-87xjc   1/1     Running             0          5m
nginx-64f497f8fd-8tc2g   1/1     Running             0          5m
nginx-64f497f8fd-8xfgz   1/1     Running             0          5m
nginx-64f497f8fd-gszkq   1/1     Running             0          5m
nginx-64f497f8fd-lz426   1/1     Running             0          5m
nginx-64f497f8fd-rzspt   0/1     ContainerCreating   0          5m
nginx-64f497f8fd-wh6sz   1/1     Running             0          5m
nginx-64f497f8fd-wtx5n   0/1     ContainerCreating   0          5m

For Pods stuck on ContainerCreating, the event shown is FailedCreatePodSandBox

> kubectl describe pod nginx-64f497f8fd-wtx5n
    Warning  FailedCreatePodSandBox  2m55s (x4 over 3m5s)    kubelet, ip-10-156-7-10.eu-west-3.compute.internal  (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "e864a95102ede98274f377a3df4a694be814be3c9c5c3cf5b2b66b9eb8bcaa1f" network for pod "nginx-64f497f8fd-wtx5n": NetworkPlugin cni failed to set up pod "nginx-64f497f8fd-wtx5n_default" network: add cmd: failed to assign an IP address to container

The only way we’ve found to solve that issue is to delete stuck Pods.

Kubernetes version : 1.11 Amazon CNI version : 1.5.0

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 15 (6 by maintainers)

Commits related to this issue

Most upvoted comments

We’ve run into this as well. The way I’ve solved it for the moment is by generating a new eni-max-pods.txt with the correct[1] values and (using Terraform) passing that in to the user-data to overwrite the existing file before bootstrap.sh runs.

  1. Not exactly correct because it depends how many daemonsets with host-network you’re running in the cluster, so I’ve erred on the lower side. Better to have a pod or two of spare capacity on the node than have 2 pods stuck in ContainerCreating.

I’m using t3.medium instances on that cluster right now.

I confirm that there are other pods in other namespaces using IPs, but I can assure you there’s sufficient ENIs for all the pods in all namespaces on the cluster for 2 reasons :

  1. Before I apply the Custom CNI configuration, all the Pods are in Running state. It’s when the nodes are terminated & recreated to apply the custom CNI that the stuck ContainerCreating happens. I usually have only 2 nodes in that cluster, but i’ve reproduced the same issue with 3 nodes.
  2. If I delete the ContainerCreating stuck pod, then the recreated one gets attributed an IP in the desired subnet and comes Running.

But I guess the error message you highlighted is in fact why the CNI plug-in does not manage to give a IP to that pod. What I don’t understand is how the CNI custom configuration can affect the scheduler on IP shortage management, and why he’s not able to unstuck them.

I’ll do some rollout specific tests tomorrow with and without the CNI custom configuration to have some more data.