amazon-vpc-cni-k8s: Some pods are getting stuck without external network

Our pods are using vault-agent, so it is very simple to see when it is getting stuck without network greping for pods like:

t5                              webapp-8bcdc6bc-wn54v                                     0/1     Init:0/2                0          21m     10.64.135.103   ip-10-64-168-217.eu-west-1.compute.internal   <none>           <none>

In logs I see connections issue:

Jun 29 07:01:34 ote001spot05-i-0cad12ccb1577476a.ad.dice.fm docker_k8s_vault-agent_partners-77c5988c88-clgh8_edge_cd84d918-b76a-11ea-b9ec-062a67e343cc_0_091bad43079c[9276]: 2020-06-29T07:01:34.839Z [ERROR] auth.handler: error authenticating: error="Put https://vault.bootstrap:8200/v1/auth/kubernetes/login: dial tcp: i/o timeout" backoff=1.101356452
Jun 29 07:01:42 ote001spot05-i-0cad12ccb1577476a.ad.dice.fm docker_k8s_vault-agent_backend-7b6bd7984-9l4vg_edge_38c19fd8-b869-11ea-b9ec-062a67e343cc_0_640b1c1ca1af[9276]: 2020-0

Sometimes some pods are getting stuck like that, deploying 1500 pods in average 17 get stuck. Often half or more pods on node are fine, but other get stuck

Single node:

t25                             partners-744ddcccbf-vftsx                                 1/1     Running                 0          17m     10.64.153.78    ip-10-64-168-217.eu-west-1.compute.internal   <none>           <none>
t25                             seatselection-77d4ddf89b-6nbw6                            1/1     Running                 0          15m     10.64.166.135   ip-10-64-168-217.eu-west-1.compute.internal   <none>           <none>
t26                             seatselection-77d4ddf89b-mh9zk                            1/1     Running                 0          15m     10.64.134.196   ip-10-64-168-217.eu-west-1.compute.internal   <none>           <none>
t26                             ticketing-687d94c666-t4k4w                                1/1     Running                 0          17m     10.64.150.152   ip-10-64-168-217.eu-west-1.compute.internal   <none>           <none>
t28                             bowie-5f8855bcff-67mrq                                    1/1     Running                 0          15m     10.64.173.8     ip-10-64-168-217.eu-west-1.compute.internal   <none>           <none>
t28                             partners-744ddcccbf-dd2vh                                 0/1     Init:0/2                0          16m     10.64.152.173   ip-10-64-168-217.eu-west-1.compute.internal   <none>           <none>
t3                              seatselection-77d4ddf89b-pjtbv                            0/1     Init:0/2                0          21m     10.64.169.91    ip-10-64-168-217.eu-west-1.compute.internal   <none>           <none>
t5                              webapp-8bcdc6bc-wn54v                                     0/1     Init:0/2                0          21m     10.64.135.103   ip-10-64-168-217.eu-west-1.compute.internal   <none>           <none>
t7                              ticketing-687d94c666-slhnc                                0/1     Init:0/2                0          21m     10.64.156.107   ip-10-64-168-217.eu-west-1.compute.internal   <none>           <none>
t8                              hedwig-5596c54f5c-tqq7v                                   0/1     Init:0/2                0          20m     10.64.176.49    ip-10-64-168-217.eu-west-1.compute.internal   <none>           <none>
t8                              kim-58656b9d6b-8qvmd                                      1/1     Running                 0          20m     10.64.180.121   ip-10-64-168-217.eu-west-1.compute.internal   <none>           <none>

When I will delete aws-node pod from corresponding node they are all recovering instantly when new one will start.

I tried - in hope that maybe it will improve - adding to aws-node MINIMUM_IP_TARGET=30 and WARM_IP_TARGET=10. Later I also add “sleep” to aws-node, maybe it is placebo but it seems when new node is created by autoscaler it is better if pods are Scheaduled before aws-node witll start starting.

It “seems” much better, but even now it is sometimes stucking:

default                         debian-xcgv6                                              1/1     Running       0          107m    10.64.13.160    ip-10-64-62-184.eu-west-1.compute.internal    <none>           <none>
kube-system                     aws-node-s5jzc                                            1/1    Running       0          33s     10.64.62.184    ip-10-64-62-184.eu-west-1.compute.internal    <none>           <none>
kube-system                     kube-proxy-m8m74                                          1/1     Running       0          108m    10.64.62.184    ip-10-64-62-184.eu-west-1.compute.internal    <none>           <none>
prometheus                      node-cadvisor-bigspot-j6ffx                               1/1     Running       0          107m    10.64.62.184    ip-10-64-62-184.eu-west-1.compute.internal    <none>           <none>
prometheus                      node-exporter-z77v6                                       1/1     Running       0          108m    10.64.62.184    ip-10-64-62-184.eu-west-1.compute.internal    <none>           <none>
search-fixes                    background-76ff6b85cb-4l5z5                               1/1     Running       0          6m37s   10.64.18.71     ip-10-64-62-184.eu-west-1.compute.internal    <none>           <none>

Over there background-76ff6b85cb-4l5z5 was stuck until I killed previous aws-node-s5jzc pod, 30s later since new one and it worked

https://drive.google.com/drive/folders/1dfjrcBOaUc3cE7gEfcLZHlpBh_Lv2oGZ?usp=sharing

More logs:

[root@ote001spot08-127921709917439 ~]# grep background-85c5ddbf9b-jdsll /var/log/aws-routed-eni/ipamd.log
{"level":"debug","ts":"2020-07-03T10:01:34.904Z","caller":"k8sapi/discovery.go:216","msg":"Update for pod background-85c5ddbf9b-jdsll: Pending"}
{"level":"info","ts":"2020-07-03T10:01:34.904Z","caller":"k8sapi/discovery.go:216","msg":"Add/Update for Pod background-85c5ddbf9b-jdsll on my node, namespace = a1, IP = "}
{"level":"debug","ts":"2020-07-03T10:01:34.926Z","caller":"k8sapi/discovery.go:216","msg":"Update for pod background-85c5ddbf9b-jdsll: Pending"}
{"level":"info","ts":"2020-07-03T10:01:34.926Z","caller":"k8sapi/discovery.go:216","msg":"Add/Update for Pod background-85c5ddbf9b-jdsll on my node, namespace = a1, IP = "}
{"level":"debug","ts":"2020-07-03T10:06:22.397Z","caller":"k8sapi/discovery.go:216","msg":"Update for pod background-85c5ddbf9b-jdsll: Pending"}
{"level":"info","ts":"2020-07-03T10:06:22.397Z","caller":"k8sapi/discovery.go:216","msg":"Add/Update for Pod background-85c5ddbf9b-jdsll on my node, namespace = a1, IP = "}

Describe when stuck:

  Type     Reason                  Age                      From                                                  Message
  ----     ------                  ----                     ----                                                  -------
  Normal   Scheduled               9m45s                    default-scheduler                                     Successfully assigned a1/background-85c5ddbf9b-jdsll to ip-10-64-154-188.eu-west-1.compute.internal
  Warning  FailedCreatePodSandBox  9m16s (x13 over 9m43s)   kubelet, ip-10-64-154-188.eu-west-1.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "background-85c5ddbf9b-jdsll": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:319: getting the final child's pid from pipe caused \"EOF\"": unknown
  Normal   SandboxChanged          4m43s (x144 over 9m42s)  kubelet, ip-10-64-154-188.eu-west-1.compute.internal  Pod sandbox changed, it will be killed and re-created.

Plugin logs:

{"level":"info","ts":"2020-07-02T09:40:06.799Z","caller":"driver/driver.go:316","msg":"Delete toContainer rule for 10.64.154.149/32 "}
{"level":"info","ts":"2020-07-02T09:40:06.800Z","caller":"driver/driver.go:316","msg":"Delete fromContainer rule for 10.64.154.149/32 in table 2"}
{"level":"debug","ts":"2020-07-02T09:40:06.800Z","caller":"driver/driver.go:316","msg":"Tear down of NS complete"}
{"level":"info","ts":"2020-07-02T11:26:41.320Z","caller":"routed-eni-cni-plugin/cni.go:252","msg":"Received CNI del request: ContainerID(4f69bc0a130e34d97457635a8fe235f54a6b278563e627ce5ab5b7659ae2f4e3) Netns(/proc/5754/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=prod-7529;K8S_POD_NAME=background-84fd9c4d49-qdqqt;K8S_POD_INFRA_CONTAINER_ID=4f69bc0a130e34d97457635a8fe235f54a6b278563e627ce5ab5b7659ae2f4e3) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.3.1\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"Debug\",\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})"}
{"level":"debug","ts":"2020-07-02T11:26:41.327Z","caller":"routed-eni-cni-plugin/cni.go:322","msg":"TeardownNS: addr 10.64.131.128/32, table 2"}
{"level":"info","ts":"2020-07-02T11:26:41.327Z","caller":"driver/driver.go:316","msg":"Delete toContainer rule for 10.64.131.128/32 "}
{"level":"info","ts":"2020-07-02T11:26:41.328Z","caller":"driver/driver.go:316","msg":"Delete fromContainer rule for 10.64.131.128/32 in table 2"}
{"level":"debug","ts":"2020-07-02T11:26:41.328Z","caller":"driver/driver.go:316","msg":"Tear down of NS complete"}
{"level":"info","ts":"2020-07-02T12:25:37.856Z","caller":"routed-eni-cni-plugin/cni.go:128","msg":"Received CNI add request: ContainerID(a43c5abc04202bab9a0769834bc61e18b64ddbd0b7e52594aecd92877fa56f54) Netns(/proc/22410/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=discovery-badges;K8S_POD_NAME=background-7c4dbf5b96-zjs9r;K8S_POD_INFRA_CONTAINER_ID=a43c5abc04202bab9a0769834bc61e18b64ddbd0b7e52594aecd92877fa56f54) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.3.1\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"Debug\",\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})"}
{"level":"debug","ts":"2020-07-02T12:25:37.857Z","caller":"routed-eni-cni-plugin/cni.go:128","msg":"MTU value set is 9001:"}
{"level":"info","ts":"2020-07-02T12:25:37.860Z","caller":"routed-eni-cni-plugin/cni.go:128","msg":"Received add network response for pod background-7c4dbf5b96-zjs9r namespace discovery-badges sandbox a43c5abc04202bab9a0769834bc61e18b64ddbd0b7e52594aecd92877fa56f54: 10.64.147.197, table 0, external-SNAT: false, vpcCIDR: [10.64.0.0/16]"}
{"level":"debug","ts":"2020-07-02T12:25:37.860Z","caller":"routed-eni-cni-plugin/cni.go:203","msg":"SetupNS: hostVethName=eni9d2f9c4fffa, contVethName=eth0, netnsPath=/proc/22410/ns/net, table=0, mtu=9001"}
{"level":"debug","ts":"2020-07-02T12:25:37.887Z","caller":"driver/driver.go:173","msg":"SetupNS: disabled IPv6 RA and ICMP redirects on eni9d2f9c4fffa"}
{"level":"debug","ts":"2020-07-02T12:25:37.895Z","caller":"driver/driver.go:173","msg":"Setup host route outgoing hostVeth, LinkIndex 262"}
{"level":"debug","ts":"2020-07-02T12:25:37.896Z","caller":"driver/driver.go:173","msg":"Successfully set host route to be 10.64.147.197/0"}
{"level":"info","ts":"2020-07-02T12:25:37.923Z","caller":"driver/driver.go:173","msg":"Added toContainer rule for 10.64.147.197/32"}

This time it actually got the IP, but after 11 minutes, most of the time it is not as lucky (there was only 6 pods on this node, so all IPs were already attached to VM waiting to be used MINIMUM_IP_TARGET=30)

~❯ kubectl --namespace a1 get pod -o wide|grep background
background-85c5ddbf9b-jdsll       1/1     Running     0          14m   10.64.166.246   ip-10-64-154-188.eu-west-1.compute.internal   <none>           <none>

Problem started last week, previously we were using aws-node 1.5.X for months, but after issues upgraded to current one last Friday - what didn’t help

We have over 100 nodes, this is cluster where developers are creating their personal environments, so as result of such big rotation (sometimes thousand of pods per day) issue happens between every couple of days, and sometimes even 2 times per day when there is big activity, on random node(s). So it seems like edge case issue, 98%ish of pods are fine, but with big activity it becomes a issue. When creating 20 environments (around 1500 pods in total), it is almost certain that 5-20 pods will get stuck

Just to make it clear, it doesn’t only happen during some crazy activity, at some rare cases it happened when deployed a single pod to node - but deploying tones of pods seems to be effective way to check if issue is still there or is it fixed. Yesterday at 5PM I deployed 1500 pods 4 times without any issue, but today got a single stuck pod again. Maybe it is some issue in AWS side, some endpoint issue what might be used by aws-node ?

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 24 (7 by maintainers)

Most upvoted comments

@rochacon We are planning to get a v1.7.2 out before the end of next week, if all testing continues to go well. We don’t plan to backport this to the v1.6.x branch, too much has changed in the code base, so an upgrade will be required.

mogren on Sep 3, 2020

This is what I am running every minute in cron:

#!/bin/bash

echo "Start Date: $(date)"
INTERFACES="$(ip addr|egrep -o "eth[0-9]+"|sort|uniq)"
echo -e "Interfaces:\n${INTERFACES}"
i=2
for int in $INTERFACES; do
        GW="$(ip route|grep default|head -n 1|egrep -o '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+')"
        echo "Checking routing table ${i}"
        if [ -z "$(ip route  show table ${i}|grep default -A 1 -B 1|grep scope)" ] ; then
                echo "Routing table ${i} not exist or empty:"
                ip route show table ${i}
                echo "---"
                echo "Adding routing"
                echo "ip route add default via ${GW} dev ${int} table ${i}:"
                ip route add default via ${GW} dev ${int} table ${i}
                echo "---"
                echo "ip route add ${GW} dev ${int} scope link table ${i}:"
                ip route add ${GW} dev ${int} scope link table ${i}
                echo "---"
        else
                echo "Routing table ${i} exists and looks OK:"
                ip route  show table ${i}
                echo "---"
        fi
        i=$[$i+1]
done
echo -e "-----------------------------------\n"

Seems it is doing the trick

michalzxc on Jul 9, 2020

@jayanthvn

[root@ote001spot03-i-0a4976e528bc159aa ~]# ip rule list
0:	from all lookup local 
512:	from all to 10.64.143.178 lookup main 
512:	from all to 10.64.147.163 lookup main 
512:	from all to 10.64.174.118 lookup main 
512:	from all to 10.64.191.29 lookup main 
512:	from all to 10.64.167.92 lookup main 
512:	from all to 10.64.190.125 lookup main 
512:	from all to 10.64.183.176 lookup main 
512:	from all to 10.64.189.108 lookup main 
512:	from all to 10.64.136.198 lookup main 
512:	from all to 10.64.158.86 lookup main 
512:	from all to 10.64.180.224 lookup main 
512:	from all to 10.64.178.20 lookup main 
512:	from all to 10.64.136.247 lookup main 
512:	from all to 10.64.191.232 lookup main 
512:	from all to 10.64.140.156 lookup main 
512:	from all to 10.64.163.78 lookup main 
512:	from all to 10.64.151.63 lookup main 
512:	from all to 10.64.150.203 lookup main 
512:	from all to 10.64.180.84 lookup main 
1024:	from all fwmark 0x80/0x80 lookup main 
1536:	from 10.64.158.86 to 10.64.0.0/16 lookup 2 
1536:	from 10.64.180.224 to 10.64.0.0/16 lookup 2 
1536:	from 10.64.178.20 to 10.64.0.0/16 lookup 2 
1536:	from 10.64.136.247 to 10.64.0.0/16 lookup 2 
1536:	from 10.64.191.232 to 10.64.0.0/16 lookup 2 
1536:	from 10.64.140.156 to 10.64.0.0/16 lookup 2 
1536:	from 10.64.163.78 to 10.64.0.0/16 lookup 2 
1536:	from 10.64.151.63 to 10.64.0.0/16 lookup 2 
1536:	from 10.64.150.203 to 10.64.0.0/16 lookup 2 
1536:	from 10.64.180.84 to 10.64.0.0/16 lookup 3 
32766:	from all lookup main 
32767:	from all lookup default

The stucked pod is

1536:	from 10.64.180.84 to 10.64.0.0/16 lookup 3

When I check routing table 2 and 3, third one is empty:

[root@ote001spot03-i-0a4976e528bc159aa ~]# ip route show table 3
[root@ote001spot03-i-0a4976e528bc159aa ~]# ip route show table 2
default via 10.64.128.1 dev eth0 
10.64.128.1 dev eth0 scope link

michalzxc on Jul 8, 2020

The changes in #1177 fixes this issue.

mogren on Sep 2, 2020

@michalzxc Thank you for reporting these issues. We are actively working on the fix. Sorry for the delayed response.

SaranBalaji90 on Aug 28, 2020

@mogren Hey, could you take a look at this one as well?

michalzxc on Jul 21, 2020

I’ve been having the same issue with v1.6.x, tried all versions of this series and still get new pods without a working network from time to time. Currently running v1.6.3. I was not able to reproduce this consistently, just as OP mentioned, the issue seems to be random. I seems to be related to IP reuse, I noticed that some of the stuck pods had the same IP as some old job pod (in completed status).

In my case, the stuck pods can’t access CoreDNS or other pods either, container restart has no effect, the only way I found to fix is to delete the bugged pod.

I wonder if this could be related to SNAT changes on 1.6? Or maybe IP cooldown?

rochacon on Jul 4, 2020