amazon-vpc-cni-k8s: Pods stuck in ContainerCreating due to CNI Failing to Assing IP to Container Until aws-node is deleted

On a node that is only 3 days old all containers scheduled to be created on this node get stuck in ContainerCreating. This is on an m4.large node. The AWS console shows that it has the maximum number of private IP’s reserved so there isn’t a problem getting resources. There are no pods running on the node other than daemon sets. All new nodes that came up after cordoning this node came up fine as well. This is a big problem because the node is considered Ready and is accepting pods despite the fact that it can’t launch any.

The resolution: once I deleted aws-node on the host from the kube-system namespace all the stuck containers came up. The version used is amazon-k8s-cni:0.1.4

In addition to trying to fix it, is there any mechanism for the aws-node process to have a health check and either get killed and restarted or drain and cordon a node if failures are detected? Even as an option?

I have left the machine running in case any more logs are needed. The logs on the host show: skipping: failed to "CreatePodSandbox" for <Pod Name> error. The main reason was: failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod \"<PodName>-67484c8f8-xj96c_default\" network: add cmd: failed to assign an IP address to container"

Other choice error logs are:

kernel: IPVS: Creating netns size=2104 id=32780
kernel: IPVS: Creating netns size=2104 id=32781
kernel: IPVS: Creating netns size=2104 id=32782
kubelet: E0410 17:59:24.393151    4070 cni.go:259] Error adding network: rpc error: code = Unavailable desc = grpc: the connection is unavailable
kubelet: E0410 17:59:24.393185    4070 cni.go:227] Error while adding to cni network: rpc error: code = Unavailable desc = grpc: the connection is unavailable
kubelet: E0410 17:59:24.427733    4070 cni.go:259] Error adding network: rpc error: code = Unavailable desc = grpc: the connection is unavailable
kubelet: E0410 17:59:24.428095    4070 cni.go:227] Error while adding to cni network: rpc error: code = Unavailable desc = grpc: the connection is unavailable
kubelet: E0410 17:59:24.506935    4070 cni.go:259] Error adding network: rpc error: code = Unavailable desc = grpc: the connection is unavailable
kubelet: E0410 17:59:24.506962    4070 cni.go:227] Error while adding to cni network: rpc error: code = Unavailable desc = grpc: the connection is unavailable
kubelet: E0410 17:59:24.509609    4070 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "<PodName>-69dfc9984b-v8dw9_default" network: rpc error: code = Unavailable desc = grpc: the connection is unavailable
kubelet: E0410 17:59:24.509661    4070 kuberuntime_sandbox.go:54] CreatePodSandbox for pod "<PodName>-69dfc9984b-v8dw9_default(8808ea13-3ce6-11e8-815e-02a9ad89df3c)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "<PodName>-69dfc9984b-v8dw9_default" network: rpc error: code = Unavailable desc = grpc: the connection is unavailable
kubelet: E0410 17:59:24.509699    4070 kuberuntime_manager.go:647] createPodSandbox for pod "<PodName>-69dfc9984b-v8dw9_default(8808ea13-3ce6-11e8-815e-02a9ad89df3c)" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "<PodName>-69dfc9984b-v8dw9_default" network: rpc error: code = Unavailable desc = grpc: the connection is unavailable
kubelet: E0410 17:59:24.509771    4070 pod_workers.go:186] Error syncing pod 8808ea13-3ce6-11e8-815e-02a9ad89df3c ("<PodName>-69dfc9984b-v8dw9_default(8808ea13-3ce6-11e8-815e-02a9ad89df3c)"), skipping: failed to "CreatePodSandbox" for "<PodName>-69dfc9984b-v8dw9_default(8808ea13-3ce6-11e8-815e-02a9ad89df3c)" with CreatePodSandboxError: "CreatePodSandbox for pod \"<PodName>-69dfc9984b-v8dw9_default(8808ea13-3ce6-11e8-815e-02a9ad89df3c)\" failed: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod \"<PodName>-69dfc9984b-v8dw9_default\" network: rpc error: code = Unavailable desc = grpc: the connection is unavailable"

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 52
Comments: 177 (56 by maintainers)

Links to this issue

Amazon VPC CNI Plugin Version 1.1 Now Available | AWS Open Source Blog

Most upvoted comments

Having similar failures on AWS EKS.

+128

stafot on Jun 20, 2018

Hello @druidsbane, this issue is probably active again - a lot of people are reporting so.

+27

pdostal on Nov 1, 2019

This is still happening. why is this issue being clossed without proper resolution or fix?

+26

santhoshm153 on May 15, 2020

I had a very similar problem, I managed to solve it by updating Wave Net.

kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"

+13

harrypalheta on Aug 3, 2020

If anyone comes here because he has an issue with the new t3a instances do this: check the current CNI plugin version: kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2 If it’s less than 1.5, upgrade: kubectl apply -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/master/config/v1.5/aws-k8s-cni.yaml

This solved it for me

+13

Jonathan-7 on Jun 24, 2019

I also have this on EKS. Very simply situation (1 cluster node, 1 simple deployment). Just scale deployment from 1 replica to 20 and I get pods that are stuck as ContainerCreating:

your-name-app1-7d8f55b547-2k7kh                             0/1       ContainerCreating   0          3m
your-name-app1-7d8f55b547-4m849                             1/1       Running             0          3m
your-name-app1-7d8f55b547-7wr5g                             0/1       ContainerCreating   0          3m
your-name-app1-7d8f55b547-7xg6j                             0/1       ContainerCreating   0          3m
your-name-app1-7d8f55b547-b97k7                             0/1       ContainerCreating   0          3m
your-name-app1-7d8f55b547-bh7r9                             0/1       ContainerCreating   0          3m
your-name-app1-7d8f55b547-bs4mj                             1/1       Running             0          3m
your-name-app1-7d8f55b547-bvqcb                             1/1       Running             0          3m

Events from the pod:

Events:
  Type     Reason                  Age               From                                 Message
  ----     ------                  ----              ----                                 -------
  Normal   Scheduled               3m                default-scheduler                    Successfully assigned your-name-app1-7d8f55b547-7xg6j to ip-10-0-2-206.ec2.internal
  Normal   SuccessfulMountVolume   3m                kubelet, ip-10-0-2-206.ec2.internal  MountVolume.SetUp succeeded for volume "default-token-4z94m"
  Warning  FailedCreatePodSandBox  3m (x12 over 3m)  kubelet, ip-10-0-2-206.ec2.internal  Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod "your-name-app1-7d8f55b547-7xg6j_default" network: add cmd: failed to assign an IP address to container
  Normal   SandboxChanged          3m (x12 over 3m)  kubelet, ip-10-0-2-206.ec2.internal  Pod sandbox changed, it will be killed and re-created.

+12

max-rocket-internet on Jun 20, 2018

issue still happening in amazon-k8s-cni:v1.7.5

tried many ways, nothing seems to work Issue is only solved by manually creating new nodes and terminating old ones

This is a major bug , proper resolution/fix is needed

siddharthshah3030 on Oct 10, 2020

I’ve just encountered this problem on eks 1.13.7 with cni 1.5.0 on a 3 node test cluster. It may or may not be related to the fact I am terminating the nodes for testing whenever I make user-data changes.

3x m5.xlarge workers reside in 3 small subnets in 3 AZs, all /26, so that might pose unique ip assignment challenges.

Interestingly, deleting the aws-node pod that was failing to assign ips did not help. In the console, I saw a few Available ENIs and 4 active ones. Note: Only 1 of them had “aws-K8S” in description.

I deleted ALL of the Available ENIs. A moment later there were:

6 Active ENIs total
3 of which had aws-K8S in description
0 Available ones And the issue was gone.

I’ll be testing this further, so I would be able to provide logs, but this was an unpleasant surprise as I thought by 1.5.0 such issues have been wiped out.

Note: beside external SNAT, I’m not touching any variables of the CNI.

krzysztof-bronk on Jul 4, 2019

As described in the design proposal, amazon-vpc-cni-k8s uses Secondary IP for Pods. And there is a limit based on the instance type. In your example, a t2.medium instance can have a maximum of 3 ENIs and each ENI can have up to 6 IP addresses. amazon-vpc-cni-k8s CNI reserves 1 IP address for each ENI. So, it can gives out maximum of 15 IP addresses. Since each node always have aws-node, kube-proxy, so, MAX_POD is set 17 for t2.medium

liwenwu-amazon on Jun 22, 2018

@liwenwu-amazon can we reopen this issue?

max-rocket-internet on Jun 20, 2018

for those still facing issues with c5n instance family, and are using AWS EKS, you can upgrade to the latest by apply the current release of CNI

kubectl apply -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/master/config/v1.3/aws-k8s-cni.yaml

see documentation here

https://docs.aws.amazon.com/eks/latest/userguide/cni-upgrades.html

for CNI version v1.3.3 to work on c5n, or any instance family, the eni-max-pods.txt file needs updating with the instance type and Mapping number. the mapping numbers for the c5n families is listed below:

c5n.large 27
c5n.xlarge 56
c5n.2xlarge 56
c5n.4xlarge 232
c5n.9xlarge 232

to work out your own instance family type mapping numbers see here

and the calculation is:

number of ENI * (number of IPv4 per ENI - 1)  + 2

example of c5n.2xlarge mapping number workout will be

4×(15−1) = 56

to automate the updating of the eni-max-pods.txt file echo can be passed to user_data

locals {
  wn_userdata = <<USERDATA
#!/bin/bash
echo "c5n.large 27
c5n.xlarge 56
c5n.2xlarge 56
c5n.4xlarge 232
c5n.9xlarge 232" >> /etc/eks/eni-max-pods.txt
/etc/eks/bootstrap.sh --apiserver-endpoint '${data.consul_keys.tf.var.eks_cluster_endpoint}' --b64-cluster-ca '${data.consul_keys.tf.var.eks_cluster_certificate}' '${var.eks_cluster_name}-${terraform.workspace}'
USERDATA
}

aaomoware on Aug 21, 2019

Same issue on t3.medium.

kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2
amazon-k8s-cni:1.2.1

Can we reopen this issue?

ghost on Jan 3, 2019

@tomfotherby thank you for collecting /var/log/aws-routed-eni/aws-cni-support.tar.gz. From it, I found one CNI bug that whenever CNI fails to setup network stack for a Pod, it does NOT release pod’s IP back to ipamD datastore. Here is error msgs in plug-xxx.log

grep "Failed to setup NS network failed to Statfs" plugin*
plugin.log.2018-07-05-17:2018-07-05T17:47:10Z [ERROR] Failed to setup NS network failed to Statfs "/proc/10617/ns/net": no such file or directory
plugin.log.2018-07-05-18:2018-07-05T18:24:08Z [ERROR] Failed to setup NS network failed to Statfs "/proc/12979/ns/net": no such file or directory

liwenwu-amazon on Jul 8, 2018

@jayanthvn Hey! In which version of the CNI the issue has been resolved? I cannot find anything related in the changelog.

vladikk on Mar 4, 2021

This issue should be re-opened.
kubernetes: v1.15.3 cni version: amazon-k8s-cni:v1.5.5 cni ENVs:

- name: AWS_VPC_K8S_CNI_LOGLEVEL
          value: DEBUG
        - name: MY_NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG
          value: "true"
        - name: ENI_CONFIG_LABEL_DEF
          value: failure-domain.beta.kubernetes.io/zone

We came across this while using t2.medium instances - coredns pods are stuck in ContainerCreating:

ip-10-128-30-89 ~ # kubectl get po --all-namespaces -o wide | grep ip-10-128-30-89.ec2.internal
kube-system   aws-encryption-provider-ip-10-128-30-89.ec2.internal    1/1     Running             0          40h   10.128.30.89    ip-10-128-30-89.ec2.internal    <none>           <none>
kube-system   aws-node-bl5p2                                          1/1     Running             1          40h   10.128.30.89    ip-10-128-30-89.ec2.internal    <none>           <none>
kube-system   coredns-5c98db65d4-4h9q2                                0/1     ContainerCreating   0          22h   <none>          ip-10-128-30-89.ec2.internal    <none>           <none>
kube-system   kube-apiserver-ip-10-128-30-89.ec2.internal             1/1     Running             0          40h   10.128.30.89    ip-10-128-30-89.ec2.internal    <none>           <none>
kube-system   kube-controller-manager-ip-10-128-30-89.ec2.internal    1/1     Running             0          40h   10.128.30.89    ip-10-128-30-89.ec2.internal    <none>           <none>
kube-system   kube-proxy-8zmt5                                        1/1     Running             0          40h   10.128.30.89    ip-10-128-30-89.ec2.internal    <none>           <none>
kube-system   kube-scheduler-ip-10-128-30-89.ec2.internal             1/1     Running             0          40h   10.128.30.89    ip-10-128-30-89.ec2.internal    <none>           <none>

notice that the ENI has a maximum of 6 IP addresses, on the 7th allocation (coredns), it does not automatically use private IPs associated to other attached ENIs.

The logs are littered with:

2019-12-20T20:22:39.824Z [WARN] UnassignPodIPv4Address: Failed to find pod coredns-5c98db65d4-4h9q2 namespace kube-system Container bd2790bb1074ac3948eb9a703db35777f29547f8d7c835647cec0d56fddaea05
2019-12-20T20:22:39.824Z [DEBUG]        UnassignPodIPv4Address: IP address pool stats: total:0, assigned 0, pod(Name: coredns-5c98db65d4-4h9q2, Namespace: kube-system, Container )
2019-12-20T20:22:39.824Z [WARN] UnassignPodIPv4Address: Failed to find pod coredns-5c98db65d4-4h9q2 namespace kube-system Container
2019-12-20T20:22:39.824Z [INFO] Send DelNetworkReply: IPv4Addr , DeviceNumber: 0, err: datastore: unknown pod
2019-12-20T20:22:40.137Z [INFO] Received AddNetwork for NS /proc/20665/ns/net, Pod coredns-5c98db65d4-4h9q2, NameSpace kube-system, Container 4b0f732f38db52e525c6f7167f6b44763905083c39f895cb310e13ec8642478d, ifname eth0
2019-12-20T20:22:40.137Z [DEBUG]        AssignIPv4Address: IP address pool stats: total: 0, assigned 0
2019-12-20T20:22:40.137Z [DEBUG]        AssignPodIPv4Address: Skip ENI eni-0ace80db803f1bedd that does not have available addresses
2019-12-20T20:22:40.137Z [ERROR]        DataStore has no available IP addresses
2019-12-20T20:22:40.137Z [DEBUG]        VPC CIDR 10.128.0.0/16
2019-12-20T20:22:40.137Z [INFO] Send AddNetworkReply: IPv4Addr , DeviceNumber: 0, err: assignPodIPv4AddressUnsafe: no available IP addresses
2019-12-20T20:22:40.163Z [INFO] Received DelNetwork for IP <nil>, Pod coredns-5c98db65d4-4h9q2, Namespace kube-system, Container 4b0f732f38db52e525c6f7167f6b44763905083c39f895cb310e13ec8642478d
2019-12-20T20:22:40.163Z [DEBUG]        UnassignPodIPv4Address: IP address pool stats: total:0, assigned 0, pod(Name: coredns-5c98db65d4-4h9q2, Namespace: kube-system, Container 4b0f732f38db52e525c6f7167f6b44763905083c39f895cb310e13ec8642478d)
2019-12-20T20:22:40.163Z [WARN] UnassignPodIPv4Address: Failed to find pod coredns-5c98db65d4-4h9q2 namespace kube-system Container 4b0f732f38db52e525c6f7167f6b44763905083c39f895cb310e13ec8642478d
2019-12-20T20:22:40.163Z [DEBUG]        UnassignPodIPv4Address: IP address pool stats: total:0, assigned 0, pod(Name: coredns-5c98db65d4-4h9q2, Namespace: kube-system, Container )
2019-12-20T20:22:40.163Z [WARN] UnassignPodIPv4Address: Failed to find pod coredns-5c98db65d4-4h9q2 namespace kube-system Container
2019-12-20T20:22:40.163Z [INFO] Send DelNetworkReply: IPv4Addr , DeviceNumber: 0, err: datastore: unknown pod
2019-12-20T20:22:40.859Z [DEBUG]        IP pool stats: total = 0, used = 0, c.maxIPsPerENI = 5
2019-12-20T20:22:40.859Z [DEBUG]        IP pool is too low: available (0) < ENI target (1) * addrsPerENI (5)
2019-12-20T20:22:40.859Z [DEBUG]        Starting to increase IP pool size
2019-12-20T20:22:40.859Z [DEBUG]        Skip the primary ENI for need IP check
2019-12-20T20:22:40.859Z [INFO] ipamd: using custom network config: [sg-0d27151dbb3b5b230], subnet-011ffa6e63cbd065c
2019-12-20T20:22:40.859Z [DEBUG]        Found security-group id: sg-0d27151dbb3b5b230
2019-12-20T20:22:40.859Z [INFO] createENI: use custom network config, &[0xc0003eaa00], subnet-011ffa6e63cbd065c
2019-12-20T20:22:40.885Z [INFO] Received DelNetwork for IP <nil>, Pod coredns-5c98db65d4-4h9q2, Namespace kube-system, Container 4b0f732f38db52e525c6f7167f6b44763905083c39f895cb310e13ec8642478d
2019-12-20T20:22:40.885Z [DEBUG]        UnassignPodIPv4Address: IP address pool stats: total:0, assigned 0, pod(Name: coredns-5c98db65d4-4h9q2, Namespace: kube-system, Container 4b0f732f38db52e525c6f7167f6b44763905083c39f895cb310e13ec8642478d)
2019-12-20T20:22:40.885Z [WARN] UnassignPodIPv4Address: Failed to find pod coredns-5c98db65d4-4h9q2 namespace kube-system Container 4b0f732f38db52e525c6f7167f6b44763905083c39f895cb310e13ec8642478d
2019-12-20T20:22:40.885Z [DEBUG]        UnassignPodIPv4Address: IP address pool stats: total:0, assigned 0, pod(Name: coredns-5c98db65d4-4h9q2, Namespace: kube-system, Container )
2019-12-20T20:22:40.885Z [WARN] UnassignPodIPv4Address: Failed to find pod coredns-5c98db65d4-4h9q2 namespace kube-system Container
2019-12-20T20:22:40.885Z [INFO] Send DelNetworkReply: IPv4Addr , DeviceNumber: 0, err: datastore: unknown pod
2019-12-20T20:22:41.004Z [INFO] Created a new ENI: eni-0bed5114b18f865e6
2019-12-20T20:22:41.004Z [DEBUG]        Trying to tag newly created ENI: key=node.k8s.amazonaws.com/instance_id, value=i-0a6a78862de42c867
2019-12-20T20:22:41.110Z [DEBUG]        Successfully tagged ENI: eni-0bed5114b18f865e6
2019-12-20T20:22:41.212Z [DEBUG]        Discovered device number is used: 0
2019-12-20T20:22:41.212Z [DEBUG]        Discovered device number is used: 1
2019-12-20T20:22:41.212Z [DEBUG]        Discovered device number is used: 2
2019-12-20T20:22:41.212Z [DEBUG]        Found a free device number: 3
2019-12-20T20:22:41.234Z [INFO] Received AddNetwork for NS /proc/20821/ns/net, Pod coredns-5c98db65d4-4h9q2, NameSpace kube-system, Container a63be77de55eed688d433be1b3002647b5a6ee288a797a6370e98c786e35c085, ifname eth0
2019-12-20T20:22:41.234Z [DEBUG]        AssignIPv4Address: IP address pool stats: total: 0, assigned 0
2019-12-20T20:22:41.234Z [DEBUG]        AssignPodIPv4Address: Skip ENI eni-0ace80db803f1bedd that does not have available addresses
2019-12-20T20:22:41.234Z [ERROR]        DataStore has no available IP addresses
2019-12-20T20:22:41.234Z [DEBUG]        VPC CIDR 10.128.0.0/16
2019-12-20T20:22:41.234Z [INFO] Send AddNetworkReply: IPv4Addr , DeviceNumber: 0, err: assignPodIPv4AddressUnsafe: no available IP addresses
2019-12-20T20:22:41.276Z [INFO] Received DelNetwork for IP <nil>, Pod coredns-5c98db65d4-4h9q2, Namespace kube-system, Container a63be77de55eed688d433be1b3002647b5a6ee288a797a6370e98c786e35c085
2019-12-20T20:22:41.276Z [DEBUG]        UnassignPodIPv4Address: IP address pool stats: total:0, assigned 0, pod(Name: coredns-5c98db65d4-4h9q2, Namespace: kube-system, Container a63be77de55eed688d433be1b3002647b5a6ee288a797a6370e98c786e35c085)
2019-12-20T20:22:41.276Z [WARN] UnassignPodIPv4Address: Failed to find pod coredns-5c98db65d4-4h9q2 namespace kube-system Container a63be77de55eed688d433be1b3002647b5a6ee288a797a6370e98c786e35c085
2019-12-20T20:22:41.276Z [DEBUG]        UnassignPodIPv4Address: IP address pool stats: total:0, assigned 0, pod(Name: coredns-5c98db65d4-4h9q2, Namespace: kube-system, Container )
2019-12-20T20:22:41.276Z [WARN] UnassignPodIPv4Address: Failed to find pod coredns-5c98db65d4-4h9q2 namespace kube-system Container
2019-12-20T20:22:41.276Z [INFO] Send DelNetworkReply: IPv4Addr , DeviceNumber: 0, err: datastore: unknown pod
2019-12-20T20:22:41.858Z [INFO] Exceeded instance ENI attachment limit: 2
2019-12-20T20:22:41.858Z [ERROR]        Failed to attach ENI eni-0bed5114b18f865e6: AttachmentLimitExceeded: Interface count 4 exceeds the limit for t2.medium
        status code: 400, request id: 86c8556c-a16e-41ed-bd76-4048f32709ad
2019-12-20T20:22:41.858Z [DEBUG]        Trying to delete ENI: eni-0bed5114b18f865e6

The ENIs are properly attached to the instance:

After deleting the aws-node pod manually, we see that its now correctly assigning ip addresses from another ENI:

ip-10-128-32-105 ~ # kubectl delete  po aws-node-bl5p2 -n kube-system e
pod "aws-node-bl5p2" deleted

ip-10-128-32-105 ~ # kubectl get po -n kube-system -o wide | grep ip-10-128-30-89.ec2.internal
aws-encryption-provider-ip-10-128-30-89.ec2.internal    1/1     Running   0          40h     10.128.30.89     ip-10-128-30-89.ec2.internal    <none>           <none>
aws-node-xlc67                                          1/1     Running   0          2m32s   10.128.30.89     ip-10-128-30-89.ec2.internal    <none>           <none>
coredns-5c98db65d4-4h9q2                                1/1     Running   0          22h     10.128.152.246   ip-10-128-30-89.ec2.internal    <none>           <none>
kube-apiserver-ip-10-128-30-89.ec2.internal             1/1     Running   0          40h     10.128.30.89     ip-10-128-30-89.ec2.internal    <none>           <none>
kube-controller-manager-ip-10-128-30-89.ec2.internal    1/1     Running   0          40h     10.128.30.89     ip-10-128-30-89.ec2.internal    <none>           <none>
kube-proxy-8zmt5                                        1/1     Running   0          40h     10.128.30.89     ip-10-128-30-89.ec2.internal    <none>           <none>
kube-scheduler-ip-10-128-30-89.ec2.internal             1/1     Running   0          40h     10.128.30.89     ip-10-128-30-89.ec2.internal    <none>           <none>

adnankobir on Dec 21, 2019

I’ve just encountered this problem after upgrading the cluster to 1.13 in EKS. Deleting the stuck pod fixes the problem (as in, the “new” pod will be created successfully). I’m using m5.large, and the cluster is provisioned with https://github.com/terraform-aws-modules/terraform-aws-eks.

EDIT: I believe my issue is related to #318. I’ve patched the daemonset to use a newer version of CNI, will see if that helps.

MrSaints on Jul 12, 2019

I hit this problem with t3.medium kubernetes 1.11 eks.1

https://docs.aws.amazon.com/eks/latest/userguide/cni-upgrades.html
Terminate all t3 nodes manually

Solved. Thanks!

sturfee-petrl on Feb 6, 2019

We’re having the same issue. We’re running the latest version of the aws-cni 1.3.

damascenorakuten on Jan 24, 2019

FIX - Had the same problem with EKS K8s 1.11.5 t2.medium worker nodes amazon-k8s-cni:1.2.1 found by kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2

FIXED it by upgrading to latest amazon-k8s-cni - instructions found here - found this - https://docs.aws.amazon.com/eks/latest/userguide/cni-upgrades.html

pkelleratwork on Jan 7, 2019

@max-rocket-interne: There is no new worker AMI. New EKS cluster automatically uses CNI 1.1. For existing cluster, it can be upgraded to CNI 1.1 by following https://docs.aws.amazon.com/eks/latest/userguide/cni-upgrades.html

liwenwu-amazon on Jul 30, 2018

Restart daemonset’s pod aws-node on specific node which is stuck and not providing IPs. If any is available will get released.

stafot on Jul 20, 2018

@liwenwu-amazon we should be able to run more than 17 pods on a t2.medium instance, right? I noticed that the other TF module has a hard coded list of pod limits per instance type: https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/local.tf#L9

max-rocket-internet on Jun 22, 2018

I have the same issue right now. Cluster version 1.18 amazon-k8s-cni-init:v1.7.5 amazon-k8s-cni:v1.7.5

sarbajitdutta on Nov 19, 2020

@jayanthvn

I am not able to reproduce it right now

Also, Since it’s production I can’t do much testing there Will update when issue happens again

Thanks

siddharthshah3030 on Oct 14, 2020

@Erokos @adnankobir I’m hesitant to re-open an issue that was closed >18 months ago, but I’ll do so anyway. I acknowledge that you and others are experiencing unacceptable levels of annoyance dealing with this issue. I’m wondering whether it’s worth opening a new issue versus re-opening this old one (which contains links to docs that don’t exist and comments about things that no longer exist in the modern CNI plugin).

Questions for you to narrow down this bug:

@adnankobir I notice you are using custom networking. Are you able by any chance to tell if this bug occurs when you do not use custom networking? @Erokos: are you using custom networking?
@adnankobir and @Erokos: have you noticed that there are specific instance types that this bug occurs with but not with other instance types? I’m wondering if perhaps there is an error in the calculated number of ENIs and/or IPs per ENI for some instance types?

@adnankobir and @Erokos: have you noticed that this issue only happens when spinning up/down many pods at once? Or does the issue occur also for “normal operations” where the rate of pods being created/deleted is low?

Thanks very much for your help folks! -jay

jaypipes on Dec 31, 2019

I have the exact situation as @krzysztof-bronk. Using CNI 1.5.0 does not fix.

caiconkhicon on Jul 31, 2019

It seems to me that when nodes terminate, the ENIs they were using do not get cleaned up. We’re using spot instances and yesterday there was a ton of churn with instances getting reclaimed and new ones coming up getting added to the cluster. At any given few minute period, we’d be gaining or losing between 10 and 100 nodes most of the day. Throughout the day, our unused ENIs went through the roof. I had to write a script to delete all “available” and run it constantly throughout the day. As long as we had ENI headroom, things went pretty well, nodes came up and pods got scheduled, but those ENIs were killer.

So basically my solution has been to write a small script to loop through and delete all available ENIs. It seems they’ll never get used and will just hang around forever. The aws_node pods will create more.

#!/usr/bin/env python3


import click
import boto3
import botocore.exceptions as exp
import time

@click.command()
@click.option('--dry-run', is_flag=True)
def main(dry_run):
    client = boto3.client('ec2')
    interfaces = client.describe_network_interfaces(
            Filters= [
                {
                    'Name': 'status',
                    'Values': [
                        'available',
                        ]
                }
            ]
        )

    ids = [ i['NetworkInterfaceId'] for i in interfaces["NetworkInterfaces"] ]
    click.secho(f"Deleting {len(ids)} enis", fg="green")
    if not dry_run:
        click.confirm("Continue?", abort=True)

    with click.progressbar(ids) as bar:
        for eni in bar:
            try:
                client.delete_network_interface(
                        NetworkInterfaceId=eni,
                        DryRun=dry_run,
                        )
            except client.exceptions.ClientError as e:
                print("exception:", e)
                time.sleep(0.5)


if __name__ == "__main__":
    main()

Here’s my script if anyone wants it. requires python3.6+, click, and awscli/boto3/botocore. The exception handling isn’t perfect, obviously, but it was just to get around the api rate limiting. Other accounts may have other needs and you may need to sleep every iteration of the loop

wreed4 on Jul 11, 2019

I’m seeing this same issue on a c5n.4xlarge instance. CoreDNS is stuck as ContainerCreating because cmd: failed to assign an IP address to container. I am using kubernetes version 1.11 on EKS with amazon-k8s-cni:v1.3.2. Everything is fine on t3.xlarge but we need c5n.4xlarge for the networking.

edfungus on Feb 27, 2019

Look like the issue is not resolved, guys.

I’m just trying to make use of EKS. Getting the same problem with both 1.2.1 and 1.3 CNI versions. To reinstall the nodes, I scale the Auto Scaling Group down to 0 and back to 8. As a result - less than a half of nodes are getting broken, so that I need to shutdown those.

megastallman on Jan 22, 2019

Is there a workaround to release the addresses from IPAMD we can use until AWS release a new node AMI with this fix included?

max-rocket-internet on Jul 20, 2018

@druidsbane from the log, it looks like you hit issue #18. here is a snips of logs ipamd.log.2018-04-10-17:

2018-04-10T17:59:17Z [INFO] Send AddNetworkReply: IPv4Addr , DeviceNumber: 0, err: datastore: no available IP addresses
2018-04-10T17:59:17Z [INFO] Received AddNetwork for NS /proc/5025/ns/net, Pod clamav-bb688497d-hrrln, NameSpace default, Container b30274f12c192d65d0e716b8b3178221cfdeaa8f116fc505c9916d02a883775d, ifname eth0
2018-04-10T17:59:17Z [DEBUG] AssignIPv4Address: IP address pool stats: total:18, assigned 18
2018-04-10T17:59:17Z [DEBUG] AssignPodIPv4Address, skip ENI eni-c10169f7 that do not have available addresses
2018-04-10T17:59:17Z [DEBUG] AssignPodIPv4Address, skip ENI eni-ad147c9b that do not have available addresses
2018-04-10T17:59:17Z [INFO] DataStore has no available IP addresses
2018-04-10T17:59:17Z [INFO] Send AddNetworkReply: IPv4Addr , DeviceNumber: 0, err: datastore: no available IP addresses
2018-04-10T17:59:17Z [INFO] Received AddNetwork for NS /proc/5140/ns/net, Pod lasso-engine-api-67484c8f8-thtlc, NameSpace default, Container 0e8ad12946302e7bbca9d97aa160ce3b0a97466d5c33052d34a73123dcc60d2b, ifname eth0
2018-04-10T17:59:17Z [DEBUG] AssignIPv4Address: IP address pool stats: total:18, assigned 18
2018-04-10T17:59:17Z [DEBUG] AssignPodIPv4Address, skip ENI eni-c10169f7 that do not have available addresses
2018-04-10T17:59:17Z [DEBUG] AssignPodIPv4Address, skip ENI eni-ad147c9b that do not have available addresses
2018-04-10T17:59:17Z [INFO] DataStore has no available IP addresses
2018-04-10T17:59:17Z [INFO] Send AddNetworkReply: IPv4Addr , DeviceNumber: 0, err: datastore: no available IP addresses
2018-04-10T17:59:17Z [INFO] Received AddNetwork for NS /proc/5104/ns/net, Pod frontend-merge-request-nginx-84959cf4bc-cpvw7, NameSpace default, Container e5b38c10611d8a5eedbf4156a374a21ed74dd467ba861e10ded526ed627c446d, ifname eth0
2018-04-10T17:59:17Z [DEBUG] AssignIPv4Address: IP address pool stats: total:18, assigned 18
2018-04-10T17:59:17Z [DEBUG] AssignPodIPv4Address, skip ENI eni-c10169f7 that do not have available addresses
2018-04-10T17:59:17Z [DEBUG] AssignPodIPv4Address, skip ENI eni-ad147c9b that do not have available addresses
2018-04-10T17:59:17Z [INFO] DataStore has no available IP addresses

we have a PR (#31) and proposal #26 to address this .

liwenwu-amazon on Apr 10, 2018

@jayanthvn Thank you for looking into this, the kubelet hosts are configured to accept a maximum of 8 pods which I now see will not work for this instance type due to the ENI limitations you described. It looks like the reason that sometimes this works and other times it didn’t was due to a race condition w/ the other pods being scheduled (which for my testing cluster includes 2 coredns pods and a metrics-server pod). I will update the max pods configuration to 5 pods for this instance type and try again.

dallasmarlow on Nov 20, 2020

Hi @dallasmarlow

You are using t3a.small instance with custom networking. T3a.small (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html) has 2 ENIs and 4 IPs per ENI. With custom networking primary ENI will not be used. So you will have only one ENI and 3 IPs since the first IP out of 4 is reserved. Hence you will have only 3 IPs and can schedule only 3 pods.

{"level":"debug","ts":"2020-11-14T21:44:52.264Z","caller":"ipamd/ipamd.go:852","msg":"IP pool stats: total = 3, used = 1, c.maxIPsPerENI = 3"}
{"level":"debug","ts":"2020-11-14T21:44:52.264Z","caller":"ipamd/ipamd.go:482","msg":"IP pool is too low: available (2) < ENI target (1) * addrsPerENI (3)"}
{"level":"debug","ts":"2020-11-14T21:44:52.264Z","caller":"ipamd/ipamd.go:483","msg":"Starting to increase IP pool size"}
{"level":"debug","ts":"2020-11-14T21:44:52.264Z","caller":"ipamd/ipamd.go:709","msg":"Skip the primary ENI for need IP check"}
{"level":"debug","ts":"2020-11-14T21:44:52.264Z","caller":"ipamd/ipamd.go:483","msg":"Skipping ENI allocation as the max ENI limit of 2 is already reached (accounting for 0 unmanaged ENIs and 0 trunk ENIs)"}

Also have you set the max pods value [Section 7 here - https://docs.aws.amazon.com/eks/latest/userguide/cni-custom-network.html] . Can you please share me the kubelet max pods value configured on the instance?

cat /etc/kubernetes/kubelet/kubelet-config.json | grep max
  "maxPods": 29

This should be the math to set the max pods -

maxPods = (numInterfaces - 1) * (maxIpv4PerInterface - 1) + 2 

Subtracting 1 from numInterfaces - Primary ENI is not used with custom networking
Subtracting 1 from maxIpv4PerInterface - Primary IP is reserved
Adding 2 as a constant to account for aws-node and kube-proxy as both use hostNetwork.

jayanthvn on Nov 20, 2020

Hi @siddharthshah3030

Can you please kindly share the below information -

Steps to repro
Cluster ARN
Is this with Custom networking
How many pods you have on this instance and the type of instance?
logs - sudo bash /opt/cni/bin/aws-cni-support.sh
Kubectl describe o/p of aws-node

You can either email it to me varavaj@amazon.com or open a support ticket and let me know the support ticket ID. We will look into the issue asap.

Reopening the issue to investigate further.

Thanks.

jayanthvn on Oct 10, 2020

@adnankobir In case my previous reply wasn’t clear enough, the issue is that in order to have different security groups for the nodes and the pods, we can’t use any secondary IPs from the first ENI, because that’s where the node IP is. That’s the reason you get less available IPs.

mogren on Jan 8, 2020

@adnankobir Yes, the formula is +2 because aws-node (the CNI pod) and kube-proxy use host networking, but are still counted against the max-pods limit.

The CNI will only use the secondary IPs for your pods, so 10 would be available for your pods in this case. The reason for this might not be valid any more, but there used to be problems using the primary IP of the ENI for pods on some distros.

mogren on Jan 6, 2020

For anyone running EKS, you need to update the CNI plugin to the latest version (currently 1.5.5). As per AWS docs: Amazon EKS does not automatically upgrade the CNI plugin on your cluster when new versions are released. So this issue will reappear periodically.

To find out your current version: kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2

To update, follow the instructions at https://docs.aws.amazon.com/eks/latest/userguide/cni-upgrades.html

Credit goes to @pkelleratwork and their Jan 7 answer in this issue: https://github.com/aws/amazon-vpc-cni-k8s/issues/59#issuecomment-451989505

fubar on Dec 18, 2019

Im getting the same issue with m4.large. Why are more than 20 pods getting scheduled on my m4.large? Then it runs out of IP addresses and I see the same issue.

romilpunetha on Apr 22, 2019

Yes, the m5ad family has been added to the master branch and will be in the v1.4.0 release, planned to be out soon.

mogren on Apr 3, 2019

Hi @liwenwu-amazon

I think I’ve made some mistake somewhere in the Terraform code. Our TF code is largely based on this but I’ve built a second cluster based on this module and it does not have this problem.

Both clusters are using same subnets, AMI and t2.medium node instance type with a single node.

I see this in the /var/log/aws-routed-eni/ipamd.log.* log files:

2018-06-21T14:31:50Z [DEBUG] AssignIPv4Address: IP address pool stats: total:15, assigned 15
2018-06-21T14:31:50Z [DEBUG] AssignPodIPv4Address, skip ENI eni-79cb08e4 that does not have available addresses
2018-06-21T14:31:50Z [DEBUG] AssignPodIPv4Address, skip ENI eni-294083b4 that does not have available addresses
2018-06-21T14:31:50Z [DEBUG] AssignPodIPv4Address, skip ENI eni-bde82b20 that does not have available addresses
2018-06-21T14:31:50Z [INFO] DataStore has no available IP addresses

To reproduce it, I just create the most basic deployment with 1 replica. Then scale the deployment to 30 replicas and then it happens.

Even if it is a configuration error on my part, it might be nice to know what the problem is since it seems a few people hit this error and can’t work out where things went wrong.

do you see these 1st time you scale up?

Yes.

How long have you wait ?

Maximum 60 seconds.

I’ve emailed you the logs.

max-rocket-internet on Jun 21, 2018

Here is the root cause of this problem: when following 2 events happens almost simultaneously:

Release an IP address for a deleted Pod ( kubelet ->(delete) -> CNI ->L-IPAM)
Allocate an unused IP address for a new Pod (kubelet ->(add) -> CNI -> L-IPAM)

L-IPAM can reassign the IP address which was just released immediately to a new Pod, instead of obeying the pod cooling period design. This can cause CNI fail to setup routing table for the newly added Pod. When this failure happens, kubelet will NOT invoke a delete Pod CNI call. Since CNI never release this IP back in the case of failure, this IP is leaked.

Here is the snips from plug.log.xxx

 // kubelet -> CNI to delete a Pod
2018-04-11T20:08:01Z [INFO] Received CNI del request: ContainerID(1eb7ad292495633a6a68876a741604164321bab048552302856f1a53ab9aeb74) Netns(/proc/17283/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=venn-frontend-v2-7f6c64df86-5vl2v;K8S_POD_INFRA_CONTAINER_ID=1eb7ad292495633a6a68876a741604164321bab048552302856f1a53ab9aeb74) Path(/opt/aws-cni/bin:/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2018-04-11T20:08:01Z [INFO] Delete toContainer rule for 192.168.109.167/32

// kubelet -> cni to add a new Pod
2018-04-11T20:08:01Z [INFO] Received CNI add request: ContainerID(86e77ed0a6917d1401695b4f8281ad23384630d9e9a3a679fbd8703f4579336f) Netns(/proc/20315/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=site-monitor-7b58cf584b-l2cfm;K8S_POD_INFRA_CONTAINER_ID=86e77ed0a6917d1401695b4f8281ad23384630d9e9a3a679fbd8703f4579336f) Path(/opt/aws-cni/bin:/opt/cni/bin) argsStdinData({"cniVersion":"","name":"aws-cni","type":"aws-cni","vethPrefix":"eni"})
2018-04-11T20:08:01Z [INFO] Received add network response for pod site-monitor-7b58cf584b-l2cfm namespace default container 86e77ed0a6917d1401695b4f8281ad23384630d9e9a3a679fbd8703f4579336f: 192.168.109.167, table 0
2018-04-11T20:08:01Z [ERROR] Failed SetupPodNetwork for pod site-monitor-7b58cf584b-l2cfm namespace default container 86e77ed0a6917d1401695b4f8281ad23384630d9e9a3a679fbd8703f4579336f: setup NS network: failed to add host route: file exists

liwenwu-amazon on Apr 16, 2018

Hey,

issue still happening on aws eks.

cluster ARN = arn:aws:eks:us-east-1:531074996011:cluster/Cuttle

kubernetes version = 1.21 Platform version = eks.2 coredns version = v1.8.4-eksbuild.1 kube-proxy version = v1.21.2-eksbuild.2 vpc-cni version = v1.9.1-eksbuild.1

I have an AWS EKS cluster and while deploying pods I’m getting this error. “failed to assign an IP address to container”

My Current container setup:-

A cluster with single node group.
The node group with 2 nodes and “c5.2xlarge” instance type.
The node group is launched in a subnet which is not used for any other resources and has an CIDR of “10.0.1.0/24”

Error: (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container “0707507c6ea3d0a38b1222ebc9acfb22bf9fc8ac5f5fa73ebd272fb107324f60” network for pod “automation-archival-hub-integration-service-6dfc4b4999-v94nq”: networkPlugin cni failed to set up pod “automation-archival-hub-integration-service-6dfc4b4999-v94nq_cuttle-automation” network: add cmd: failed to assign an IP address to container.

Extra details:-

The “c5.2x large” can have 58 pods running in it. since only 4 network interface can be attached to the above instance type.
Our cluster has more than 75 pods.
When the 2nd node is launched only 2 network interfaces are created by the cluster and are attached to the node, since the network interface allows less than 15 secondary IP.

Thanks, Nagaraj

NagarajBM on Oct 27, 2021

Thanks for confirming @dallasmarlow 😃

jayanthvn on Nov 20, 2020

Why is this issue closed when a lot of people keep reporting problems? For the past few days I’ve been experimenting with EKS cluster creation. I’m using terraform, actually a terraform module similar to the popular community module. What I’ve observed:

Creating clusters below version 1.14 have no problems with worker nodes being in a “Ready” state. I’m using the latest CNI version: amazon-k8s-cni:v1.5.5 BUT,
No matter what I try when creating 1.14 version clusters, worker nodes are in the “NotReady” state even though I’ve applied the aws-auth-cm.yaml configmap and the latest CNI version. Upon closer look (kubectl describe node <node_name>) I see an error that the CNI is uninitialized, also when I look at the running pods (kubectl get pods -n kube-system) I can see the core-dns pods being in a “pending” state and the aws-node pods crashing every few seconds. I’ve then taken some steps to see if I could fix it: a) downgraded the CNI version to 1.5.3 - this resulted in nodes getting to “Ready” state but this didn’t fix the problem, the core-dns pods were now in “ContainerCreating” status constantly and aws-node pods had the same behaviour. Upgrading the CNI back to 1.5.5 didn’t change anything. b) Next what I tried was to create a 1.13 cluster first with nodes using a 1.14 kubernetes AMI. The nodes didn’t have any problems joining the cluster and were ready. I then upgraded the cluster version and this resulted in a working 1.14 cluster with the nodes joined and being ready. - HOWEVER, if I increased the number of nodes in an auto scaling group, the new nodes had the same old problems of not being ready no matter what I tried.

To sum up, I’ve decided to use a 1.13 cluster in which I see no problems with nodes using a 1.14 AMI in hopes of fixing this problem in the near future.

Erokos on Dec 8, 2019

I’ve solved for the moment by upgrading cni to v1.5.5 and modifying the deamonset adding the environment variable ‘WARM_IP_TARGET’ set to value relative to this table ( https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html ) to force ip addresses / network interface reservation.

cni configuration doc: https://docs.aws.amazon.com/eks/latest/userguide/cni-env-vars.html

Edit: The configuration just has delayed the problem, still happening when scaling up/down both pods and nodes

GaruGaru on Dec 13, 2019

@pdostal No, still failing in 1.5.4.

My problem happened after I changed instance type to t3.medium.

I changed instance type back to t3.large, no error.

(I use branch release-1.5.4 to upgrade to 1.5.4)

fmb-chin on Oct 30, 2019

Following up on my post earlier, I can easily replicate this issue even with aws-cni 1.5.1

I have a 1 node cluster, and every time I terminate the instance from the UI so that ASG can spin up a replacement, the ENI is left dangling as Available, with IPs attached, and never deleted as far as I can tell. So sooner or later you will exhaust IPs and then pods will fail to be started.

krzysztof-bronk on Sep 3, 2019

@MrSaints

I’ve just encountered this problem after upgrading the cluster to 1.13 in EKS. Deleting the stuck pod fixes the problem (as in, the “new” pod will be created successfully). I’m using m5.large, and the cluster is provisioned with https://github.com/terraform-aws-modules/terraform-aws-eks.

EDIT: I believe my issue is related to #318. I’ve patched the daemonset to use a newer version of CNI, will see if that helps.

Using EKS 1.11 with cni 1.2.1, can confirm that deleting the pod and recreating is a workaround.

amine250 on Aug 28, 2019

please reopen - hitting this too on EKS1.11.9 with AMI amazon-eks-node-1.11-v20190329 (ami-0e82e73403dd69fa3) and cni 1.3.2 on instance type m4.2xlarge

$ kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2
amazon-k8s-cni:v1.3.2

flegler-zz on Apr 3, 2019

Hi,

I think there are a mix of problems mentioned in this issue that now are tracked in newer open issues. For example, all c5n family issues are because the 1.3.2 does not include that instance type. The initial bugs in this issue was fixed by Liwen, but there are still related issues with IP and ENI allocation and deletions tracked in #69, #123, #330 and #294.

Thanks for reporting issues!

mogren on Mar 16, 2019

Withdrawing this. In our case it was due to IP pool exhaustion. I have provisioned worker nodes in a new autoscaling group and increased the /26 ranges to /24 and we are up and running.

rust84 on Mar 7, 2019

I’m running into the same issue on my t3.medium instances… I’ve triple checked my cluster’s CNI version is amazon-k8s-cni:v1.3.2

describing one of the pods, I get this

Events:
  Type     Reason                  Age               From                                      Message
  ----     ------                  ----              ----                                      -------
  Normal   Scheduled               6m                default-scheduler                         Successfully assigned default/redis-master-2mqbp to ip-192-168-172-104.ec2.internal
  Warning  FailedCreatePodSandBox  6m                kubelet, ip-192-168-172-104.ec2.internal  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "bac4d2c0522a72c16b9303cea1ea2d07314f9cc9c8dc27abeaa35f0bfe2ad87d" network for pod "redis-master-2mqbp": NetworkPlugin cni failed to set up pod "redis-master-2mqbp_default" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused", failed to clean up sandbox container "bac4d2c0522a72c16b9303cea1ea2d07314f9cc9c8dc27abeaa35f0bfe2ad87d" network for pod "redis-master-2mqbp": NetworkPlugin cni failed to teardown pod "redis-master-2mqbp_default" network: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused"]
  Normal   SandboxChanged          1m (x25 over 6m)  kubelet, ip-192-168-172-104.ec2.internal  Pod sandbox changed, it will be killed and re-created.

and if I run /opt/cni/bin/aws-cni-support.sh on a node, I get

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to localhost port 61678: Connection refused

any ideas?

elvisnovoa on Mar 3, 2019

This just started happening for me. I haven’t changed anything in cluster (such as node type) lately.

UPDATE: Disregard. Updated to amazon-k8s-cni:1.3.0 and that fixed my issue.

rkarroll on Jan 7, 2019

I am hitting this problem every week now as we have quite a volatile environment.

I see v1.1.0 is released just now. How long until we have a new worker AMI? Is this the repo to watch for a new release? https://github.com/awslabs/amazon-eks-ami

Thanks.

max-rocket-internet on Jul 30, 2018

Good job @liwenwu-amazon . I’m glad to have helped.

tomfotherby on Jul 8, 2018

@stafot few more questions:

can you check ipamD datastore by running, and do u see any discrepancy there?

curl http://localhost:61678/v1/pods | python -m json.tool

if you restart ipamdD, do u see problem goes away?
also can you share or send me(liwenwu@amazon.com) the ipamD metrics ?

curl http://localhost:61678/metrics

liwenwu-amazon on Jun 21, 2018

@max-rocket-internet can you collect node level debugging info by running

/opt/cni/bin/aws-cni-support.sh

#collect
/var/log/aws-routed-eni/aws-cni-support.tar.gz

you can send it to me (liwenwu@amazon.com) or attach it to this issue.

In addition, i have few more questions

what’s the instance type of this node?
do you see these 1st time you scale up? How long have you wait ? Right now, if your instance requires more ENIs, the ipamD will need to call EC2 API to allocate and attach ENI to your instance. This can take sometime depends how busy is your VPC.

thanks,

liwenwu-amazon on Jun 20, 2018

I also got stuck in ContainerCreating: Warning FailedCreatePodSandBox 16m (x58161 over 17h) kubelet, ip-172-x.x.x.us-west-2.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod “wordpress-xxxx-vmtkg_default” network: add cmd: failed to assign an IP address to container

And the fix was to reload “aws-node” pod on the Node which stopped issuing IP’s

edwize on Apr 25, 2018