amazon-vpc-cni-k8s: Pods stuck in ContainerCreating due to CNI Failing to Assing IP to Container Until aws-node is deleted
On a node that is only 3 days old all containers scheduled to be created on this node get stuck in ContainerCreating
. This is on an m4.large
node. The AWS console shows that it has the maximum number of private IP’s reserved so there isn’t a problem getting resources. There are no pods running on the node other than daemon sets. All new nodes that came up after cordoning this node came up fine as well. This is a big problem because the node is considered Ready
and is accepting pods despite the fact that it can’t launch any.
The resolution: once I deleted aws-node
on the host from the kube-system
namespace all the stuck containers came up. The version used is amazon-k8s-cni:0.1.4
In addition to trying to fix it, is there any mechanism for the aws-node
process to have a health check and either get killed and restarted or drain and cordon a node if failures are detected? Even as an option?
I have left the machine running in case any more logs are needed. The logs on the host show:
skipping: failed to "CreatePodSandbox" for <Pod Name>
error. The main reason was: failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod \"<PodName>-67484c8f8-xj96c_default\" network: add cmd: failed to assign an IP address to container"
Other choice error logs are:
kernel: IPVS: Creating netns size=2104 id=32780
kernel: IPVS: Creating netns size=2104 id=32781
kernel: IPVS: Creating netns size=2104 id=32782
kubelet: E0410 17:59:24.393151 4070 cni.go:259] Error adding network: rpc error: code = Unavailable desc = grpc: the connection is unavailable
kubelet: E0410 17:59:24.393185 4070 cni.go:227] Error while adding to cni network: rpc error: code = Unavailable desc = grpc: the connection is unavailable
kubelet: E0410 17:59:24.427733 4070 cni.go:259] Error adding network: rpc error: code = Unavailable desc = grpc: the connection is unavailable
kubelet: E0410 17:59:24.428095 4070 cni.go:227] Error while adding to cni network: rpc error: code = Unavailable desc = grpc: the connection is unavailable
kubelet: E0410 17:59:24.506935 4070 cni.go:259] Error adding network: rpc error: code = Unavailable desc = grpc: the connection is unavailable
kubelet: E0410 17:59:24.506962 4070 cni.go:227] Error while adding to cni network: rpc error: code = Unavailable desc = grpc: the connection is unavailable
kubelet: E0410 17:59:24.509609 4070 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "<PodName>-69dfc9984b-v8dw9_default" network: rpc error: code = Unavailable desc = grpc: the connection is unavailable
kubelet: E0410 17:59:24.509661 4070 kuberuntime_sandbox.go:54] CreatePodSandbox for pod "<PodName>-69dfc9984b-v8dw9_default(8808ea13-3ce6-11e8-815e-02a9ad89df3c)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "<PodName>-69dfc9984b-v8dw9_default" network: rpc error: code = Unavailable desc = grpc: the connection is unavailable
kubelet: E0410 17:59:24.509699 4070 kuberuntime_manager.go:647] createPodSandbox for pod "<PodName>-69dfc9984b-v8dw9_default(8808ea13-3ce6-11e8-815e-02a9ad89df3c)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "<PodName>-69dfc9984b-v8dw9_default" network: rpc error: code = Unavailable desc = grpc: the connection is unavailable
kubelet: E0410 17:59:24.509771 4070 pod_workers.go:186] Error syncing pod 8808ea13-3ce6-11e8-815e-02a9ad89df3c ("<PodName>-69dfc9984b-v8dw9_default(8808ea13-3ce6-11e8-815e-02a9ad89df3c)"), skipping: failed to "CreatePodSandbox" for "<PodName>-69dfc9984b-v8dw9_default(8808ea13-3ce6-11e8-815e-02a9ad89df3c)" with CreatePodSandboxError: "CreatePodSandbox for pod \"<PodName>-69dfc9984b-v8dw9_default(8808ea13-3ce6-11e8-815e-02a9ad89df3c)\" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod \"<PodName>-69dfc9984b-v8dw9_default\" network: rpc error: code = Unavailable desc = grpc: the connection is unavailable"
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 52
- Comments: 177 (56 by maintainers)
Having similar failures on
AWS EKS
.Hello @druidsbane, this issue is probably active again - a lot of people are reporting so.
This is still happening. why is this issue being clossed without proper resolution or fix?
I had a very similar problem, I managed to solve it by updating Wave Net.
If anyone comes here because he has an issue with the new
t3a
instances do this: check the current CNI plugin version:kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2
If it’s less than 1.5, upgrade:kubectl apply -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/master/config/v1.5/aws-k8s-cni.yaml
This solved it for me
I also have this on EKS. Very simply situation (1 cluster node, 1 simple deployment). Just scale deployment from 1 replica to 20 and I get pods that are stuck as
ContainerCreating
:Events from the pod:
issue still happening in amazon-k8s-cni:v1.7.5
tried many ways, nothing seems to work Issue is only solved by manually creating new nodes and terminating old ones
This is a major bug , proper resolution/fix is needed
I’ve just encountered this problem on eks 1.13.7 with cni 1.5.0 on a 3 node test cluster. It may or may not be related to the fact I am terminating the nodes for testing whenever I make user-data changes.
3x m5.xlarge workers reside in 3 small subnets in 3 AZs, all /26, so that might pose unique ip assignment challenges.
Interestingly, deleting the aws-node pod that was failing to assign ips did not help. In the console, I saw a few Available ENIs and 4 active ones. Note: Only 1 of them had “aws-K8S” in description.
I deleted ALL of the Available ENIs. A moment later there were:
I’ll be testing this further, so I would be able to provide logs, but this was an unpleasant surprise as I thought by 1.5.0 such issues have been wiped out.
Note: beside external SNAT, I’m not touching any variables of the CNI.
As described in the design proposal, amazon-vpc-cni-k8s uses Secondary IP for Pods. And there is a limit based on the instance type. In your example, a t2.medium instance can have a maximum of 3 ENIs and each ENI can have up to 6 IP addresses. amazon-vpc-cni-k8s CNI reserves 1 IP address for each ENI. So, it can gives out maximum of 15 IP addresses. Since each node always have aws-node, kube-proxy, so, MAX_POD is set 17 for t2.medium
@liwenwu-amazon can we reopen this issue?
for those still facing issues with c5n instance family, and are using AWS EKS, you can upgrade to the latest by apply the current release of CNI
see documentation here
for CNI version v1.3.3 to work on c5n, or any instance family, the eni-max-pods.txt file needs updating with the instance type and Mapping number. the mapping numbers for the c5n families is listed below:
to work out your own instance family type mapping numbers see here
and the calculation is:
example of c5n.2xlarge mapping number workout will be
to automate the updating of the eni-max-pods.txt file
echo
can be passed to user_dataSame issue on t3.medium.
Can we reopen this issue?
@tomfotherby thank you for collecting
/var/log/aws-routed-eni/aws-cni-support.tar.gz
. From it, I found one CNI bug that whenever CNI fails to setup network stack for a Pod, it does NOT release pod’s IP back to ipamD datastore. Here is error msgs in plug-xxx.log@jayanthvn Hey! In which version of the CNI the issue has been resolved? I cannot find anything related in the changelog.
This issue should be re-opened.
kubernetes:
v1.15.3
cni version:amazon-k8s-cni:v1.5.5
cni ENVs:We came across this while using t2.medium instances - coredns pods are stuck in ContainerCreating:
notice that the ENI has a maximum of 6 IP addresses, on the 7th allocation (coredns), it does not automatically use private IPs associated to other attached ENIs.
The logs are littered with:
The ENIs are properly attached to the instance:
After deleting the aws-node pod manually, we see that its now correctly assigning ip addresses from another ENI:
I’ve just encountered this problem after upgrading the cluster to 1.13 in EKS. Deleting the stuck pod fixes the problem (as in, the “new” pod will be created successfully). I’m using
m5.large
, and the cluster is provisioned with https://github.com/terraform-aws-modules/terraform-aws-eks.EDIT: I believe my issue is related to #318. I’ve patched the daemonset to use a newer version of CNI, will see if that helps.
I hit this problem with
t3.medium
kubernetes 1.11 eks.1t3
nodes manuallySolved. Thanks!
We’re having the same issue. We’re running the latest version of the aws-cni 1.3.
FIX - Had the same problem with EKS K8s 1.11.5 t2.medium worker nodes amazon-k8s-cni:1.2.1 found by
kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2
FIXED it by upgrading to latest amazon-k8s-cni - instructions found here - found this - https://docs.aws.amazon.com/eks/latest/userguide/cni-upgrades.html
@max-rocket-interne: There is no new worker AMI. New EKS cluster automatically uses CNI 1.1. For existing cluster, it can be upgraded to CNI 1.1 by following https://docs.aws.amazon.com/eks/latest/userguide/cni-upgrades.html
Restart
daemonset
’spod
aws-node
on specific node which is stuck and not providing IPs. If any is available will get released.@liwenwu-amazon we should be able to run more than 17 pods on a
t2.medium
instance, right? I noticed that the other TF module has a hard coded list of pod limits per instance type: https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/local.tf#L9I have the same issue right now. Cluster version 1.18 amazon-k8s-cni-init:v1.7.5 amazon-k8s-cni:v1.7.5
@jayanthvn
I am not able to reproduce it right now
Also, Since it’s production I can’t do much testing there Will update when issue happens again
Thanks
@Erokos @adnankobir I’m hesitant to re-open an issue that was closed >18 months ago, but I’ll do so anyway. I acknowledge that you and others are experiencing unacceptable levels of annoyance dealing with this issue. I’m wondering whether it’s worth opening a new issue versus re-opening this old one (which contains links to docs that don’t exist and comments about things that no longer exist in the modern CNI plugin).
Questions for you to narrow down this bug:
@adnankobir I notice you are using custom networking. Are you able by any chance to tell if this bug occurs when you do not use custom networking? @Erokos: are you using custom networking?
@adnankobir and @Erokos: have you noticed that there are specific instance types that this bug occurs with but not with other instance types? I’m wondering if perhaps there is an error in the calculated number of ENIs and/or IPs per ENI for some instance types?
Thanks very much for your help folks! -jay
I have the exact situation as @krzysztof-bronk. Using CNI 1.5.0 does not fix.
It seems to me that when nodes terminate, the ENIs they were using do not get cleaned up. We’re using spot instances and yesterday there was a ton of churn with instances getting reclaimed and new ones coming up getting added to the cluster. At any given few minute period, we’d be gaining or losing between 10 and 100 nodes most of the day. Throughout the day, our unused ENIs went through the roof. I had to write a script to delete all “available” and run it constantly throughout the day. As long as we had ENI headroom, things went pretty well, nodes came up and pods got scheduled, but those ENIs were killer.
So basically my solution has been to write a small script to loop through and delete all available ENIs. It seems they’ll never get used and will just hang around forever. The aws_node pods will create more.
Here’s my script if anyone wants it. requires python3.6+, click, and awscli/boto3/botocore. The exception handling isn’t perfect, obviously, but it was just to get around the api rate limiting. Other accounts may have other needs and you may need to sleep every iteration of the loop
I’m seeing this same issue on a
c5n.4xlarge
instance. CoreDNS is stuck asContainerCreating
becausecmd: failed to assign an IP address to container
. I am using kubernetes version1.11
on EKS withamazon-k8s-cni:v1.3.2
. Everything is fine ont3.xlarge
but we needc5n.4xlarge
for the networking.Look like the issue is not resolved, guys.
I’m just trying to make use of EKS. Getting the same problem with both 1.2.1 and 1.3 CNI versions. To reinstall the nodes, I scale the Auto Scaling Group down to 0 and back to 8. As a result - less than a half of nodes are getting broken, so that I need to shutdown those.
Is there a workaround to release the addresses from IPAMD we can use until AWS release a new node AMI with this fix included?
@druidsbane from the log, it looks like you hit issue #18. here is a snips of logs ipamd.log.2018-04-10-17:
we have a PR (#31) and proposal #26 to address this .
@jayanthvn Thank you for looking into this, the kubelet hosts are configured to accept a maximum of 8 pods which I now see will not work for this instance type due to the ENI limitations you described. It looks like the reason that sometimes this works and other times it didn’t was due to a race condition w/ the other pods being scheduled (which for my testing cluster includes 2 coredns pods and a metrics-server pod). I will update the max pods configuration to 5 pods for this instance type and try again.
Hi @dallasmarlow
You are using t3a.small instance with custom networking. T3a.small (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html) has 2 ENIs and 4 IPs per ENI. With custom networking primary ENI will not be used. So you will have only one ENI and 3 IPs since the first IP out of 4 is reserved. Hence you will have only 3 IPs and can schedule only 3 pods.
Also have you set the max pods value [Section 7 here - https://docs.aws.amazon.com/eks/latest/userguide/cni-custom-network.html] . Can you please share me the kubelet max pods value configured on the instance?
This should be the math to set the max pods -
Hi @siddharthshah3030
Can you please kindly share the below information -
sudo bash /opt/cni/bin/aws-cni-support.sh
You can either email it to me varavaj@amazon.com or open a support ticket and let me know the support ticket ID. We will look into the issue asap.
Reopening the issue to investigate further.
Thanks.
@adnankobir In case my previous reply wasn’t clear enough, the issue is that in order to have different security groups for the nodes and the pods, we can’t use any secondary IPs from the first ENI, because that’s where the node IP is. That’s the reason you get less available IPs.
@adnankobir Yes, the formula is +2 because aws-node (the CNI pod) and kube-proxy use host networking, but are still counted against the
max-pods
limit.The CNI will only use the secondary IPs for your pods, so 10 would be available for your pods in this case. The reason for this might not be valid any more, but there used to be problems using the primary IP of the ENI for pods on some distros.
For anyone running EKS, you need to update the CNI plugin to the latest version (currently 1.5.5). As per AWS docs:
Amazon EKS does not automatically upgrade the CNI plugin on your cluster when new versions are released
. So this issue will reappear periodically.To find out your current version:
kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2
To update, follow the instructions at https://docs.aws.amazon.com/eks/latest/userguide/cni-upgrades.html
Credit goes to @pkelleratwork and their Jan 7 answer in this issue: https://github.com/aws/amazon-vpc-cni-k8s/issues/59#issuecomment-451989505
Im getting the same issue with m4.large. Why are more than 20 pods getting scheduled on my m4.large? Then it runs out of IP addresses and I see the same issue.
Yes, the m5ad family has been added to the master branch and will be in the v1.4.0 release, planned to be out soon.
Hi @liwenwu-amazon
I think I’ve made some mistake somewhere in the Terraform code. Our TF code is largely based on this but I’ve built a second cluster based on this module and it does not have this problem.
Both clusters are using same subnets, AMI and
t2.medium
node instance type with a single node.I see this in the
/var/log/aws-routed-eni/ipamd.log.*
log files:To reproduce it, I just create the most basic deployment with 1 replica. Then scale the deployment to 30 replicas and then it happens.
Even if it is a configuration error on my part, it might be nice to know what the problem is since it seems a few people hit this error and can’t work out where things went wrong.
Yes.
Maximum 60 seconds.
I’ve emailed you the logs.
Here is the root cause of this problem: when following 2 events happens almost simultaneously:
L-IPAM can reassign the IP address which was just released immediately to a new Pod, instead of obeying the pod cooling period design. This can cause CNI fail to setup routing table for the newly added Pod. When this failure happens, kubelet will NOT invoke a delete Pod CNI call. Since CNI never release this IP back in the case of failure, this IP is leaked.
Here is the snips from plug.log.xxx
Hey,
issue still happening on aws eks.
cluster ARN = arn:aws:eks:us-east-1:531074996011:cluster/Cuttle
kubernetes version = 1.21 Platform version = eks.2 coredns version = v1.8.4-eksbuild.1 kube-proxy version = v1.21.2-eksbuild.2 vpc-cni version = v1.9.1-eksbuild.1
I have an AWS EKS cluster and while deploying pods I’m getting this error. “failed to assign an IP address to container”
My Current container setup:-
Error: (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container “0707507c6ea3d0a38b1222ebc9acfb22bf9fc8ac5f5fa73ebd272fb107324f60” network for pod “automation-archival-hub-integration-service-6dfc4b4999-v94nq”: networkPlugin cni failed to set up pod “automation-archival-hub-integration-service-6dfc4b4999-v94nq_cuttle-automation” network: add cmd: failed to assign an IP address to container.
Extra details:-
Thanks, Nagaraj
Thanks for confirming @dallasmarlow 😃
Why is this issue closed when a lot of people keep reporting problems? For the past few days I’ve been experimenting with EKS cluster creation. I’m using terraform, actually a terraform module similar to the popular community module. What I’ve observed:
kubectl describe node <node_name>
) I see an error that the CNI is uninitialized, also when I look at the running pods (kubectl get pods -n kube-system
) I can see the core-dns pods being in a “pending” state and the aws-node pods crashing every few seconds. I’ve then taken some steps to see if I could fix it: a) downgraded the CNI version to 1.5.3 - this resulted in nodes getting to “Ready” state but this didn’t fix the problem, the core-dns pods were now in “ContainerCreating” status constantly and aws-node pods had the same behaviour. Upgrading the CNI back to 1.5.5 didn’t change anything. b) Next what I tried was to create a 1.13 cluster first with nodes using a 1.14 kubernetes AMI. The nodes didn’t have any problems joining the cluster and were ready. I then upgraded the cluster version and this resulted in a working 1.14 cluster with the nodes joined and being ready. - HOWEVER, if I increased the number of nodes in an auto scaling group, the new nodes had the same old problems of not being ready no matter what I tried.To sum up, I’ve decided to use a 1.13 cluster in which I see no problems with nodes using a 1.14 AMI in hopes of fixing this problem in the near future.
I’ve solved for the moment by upgrading cni to v1.5.5 and modifying the deamonset adding the environment variable ‘WARM_IP_TARGET’ set to value relative to this table ( https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html ) to force ip addresses / network interface reservation.
cni configuration doc: https://docs.aws.amazon.com/eks/latest/userguide/cni-env-vars.html
Edit: The configuration just has delayed the problem, still happening when scaling up/down both pods and nodes
@pdostal No, still failing in 1.5.4.
My problem happened after I changed instance type to t3.medium.
I changed instance type back to t3.large, no error.
(I use branch
release-1.5.4
to upgrade to 1.5.4)Following up on my post earlier, I can easily replicate this issue even with aws-cni 1.5.1
I have a 1 node cluster, and every time I terminate the instance from the UI so that ASG can spin up a replacement, the ENI is left dangling as Available, with IPs attached, and never deleted as far as I can tell. So sooner or later you will exhaust IPs and then pods will fail to be started.
Using EKS 1.11 with cni 1.2.1, can confirm that deleting the pod and recreating is a workaround.
please reopen - hitting this too on EKS1.11.9 with AMI amazon-eks-node-1.11-v20190329 (ami-0e82e73403dd69fa3) and cni 1.3.2 on instance type m4.2xlarge
Hi,
I think there are a mix of problems mentioned in this issue that now are tracked in newer open issues. For example, all c5n family issues are because the 1.3.2 does not include that instance type. The initial bugs in this issue was fixed by Liwen, but there are still related issues with IP and ENI allocation and deletions tracked in #69, #123, #330 and #294.
Thanks for reporting issues!
Withdrawing this. In our case it was due to IP pool exhaustion. I have provisioned worker nodes in a new autoscaling group and increased the /26 ranges to /24 and we are up and running.
I’m running into the same issue on my t3.medium instances… I’ve triple checked my cluster’s CNI version is
amazon-k8s-cni:v1.3.2
describing one of the pods, I get this
and if I run
/opt/cni/bin/aws-cni-support.sh
on a node, I getany ideas?
This just started happening for me. I haven’t changed anything in cluster (such as node type) lately.
UPDATE: Disregard. Updated to amazon-k8s-cni:1.3.0 and that fixed my issue.
I am hitting this problem every week now as we have quite a volatile environment.
I see
v1.1.0
is released just now. How long until we have a new worker AMI? Is this the repo to watch for a new release? https://github.com/awslabs/amazon-eks-amiThanks.
Good job @liwenwu-amazon . I’m glad to have helped.
@stafot few more questions:
@max-rocket-internet can you collect node level debugging info by running
you can send it to me (liwenwu@amazon.com) or attach it to this issue.
In addition, i have few more questions
thanks,
I also got stuck in ContainerCreating: Warning FailedCreatePodSandBox 16m (x58161 over 17h) kubelet, ip-172-x.x.x.us-west-2.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod “wordpress-xxxx-vmtkg_default” network: add cmd: failed to assign an IP address to container
And the fix was to reload “aws-node” pod on the Node which stopped issuing IP’s