amazon-vpc-cni-k8s: Containers stuck in ContainerCreating after configuring CNI Custom Networking on extended CIDR
Hi, We have an issue on CNI Custom networking & extended CIDR after nodes first boot if we have Pods pending for scheduling.
For example, on a simple nginx workload w/ 10 replicas, after node first boot we have :
> kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-64f497f8fd-2sdl2 1/1 Running 0 5m
nginx-64f497f8fd-7m868 1/1 Running 0 5m
nginx-64f497f8fd-87xjc 1/1 Running 0 5m
nginx-64f497f8fd-8tc2g 1/1 Running 0 5m
nginx-64f497f8fd-8xfgz 1/1 Running 0 5m
nginx-64f497f8fd-gszkq 1/1 Running 0 5m
nginx-64f497f8fd-lz426 1/1 Running 0 5m
nginx-64f497f8fd-rzspt 0/1 ContainerCreating 0 5m
nginx-64f497f8fd-wh6sz 1/1 Running 0 5m
nginx-64f497f8fd-wtx5n 0/1 ContainerCreating 0 5m
For Pods stuck on ContainerCreating, the event shown is FailedCreatePodSandBox
> kubectl describe pod nginx-64f497f8fd-wtx5n
Warning FailedCreatePodSandBox 2m55s (x4 over 3m5s) kubelet, ip-10-156-7-10.eu-west-3.compute.internal (combined from similar events): Failed create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "e864a95102ede98274f377a3df4a694be814be3c9c5c3cf5b2b66b9eb8bcaa1f" network for pod "nginx-64f497f8fd-wtx5n": NetworkPlugin cni failed to set up pod "nginx-64f497f8fd-wtx5n_default" network: add cmd: failed to assign an IP address to container
The only way we’ve found to solve that issue is to delete stuck Pods.
Kubernetes version : 1.11 Amazon CNI version : 1.5.0
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 15 (6 by maintainers)
We’ve run into this as well. The way I’ve solved it for the moment is by generating a new
eni-max-pods.txt
with the correct[1] values and (using Terraform) passing that in to the user-data to overwrite the existing file beforebootstrap.sh
runs.I’m using t3.medium instances on that cluster right now.
I confirm that there are other pods in other namespaces using IPs, but I can assure you there’s sufficient ENIs for all the pods in all namespaces on the cluster for 2 reasons :
Running
state. It’s when the nodes are terminated & recreated to apply the custom CNI that the stuckContainerCreating
happens. I usually have only 2 nodes in that cluster, but i’ve reproduced the same issue with 3 nodes.ContainerCreating
stuck pod, then the recreated one gets attributed an IP in the desired subnet and comesRunning
.But I guess the error message you highlighted is in fact why the CNI plug-in does not manage to give a IP to that pod. What I don’t understand is how the CNI custom configuration can affect the scheduler on IP shortage management, and why he’s not able to unstuck them.
I’ll do some rollout specific tests tomorrow with and without the CNI custom configuration to have some more data.