amazon-vpc-cni-k8s: Some pods are getting stuck without external network
Our pods are using vault-agent, so it is very simple to see when it is getting stuck without network greping for pods like:
t5 webapp-8bcdc6bc-wn54v 0/1 Init:0/2 0 21m 10.64.135.103 ip-10-64-168-217.eu-west-1.compute.internal <none> <none>
In logs I see connections issue:
Jun 29 07:01:34 ote001spot05-i-0cad12ccb1577476a.ad.dice.fm docker_k8s_vault-agent_partners-77c5988c88-clgh8_edge_cd84d918-b76a-11ea-b9ec-062a67e343cc_0_091bad43079c[9276]: 2020-06-29T07:01:34.839Z [ERROR] auth.handler: error authenticating: error="Put https://vault.bootstrap:8200/v1/auth/kubernetes/login: dial tcp: i/o timeout" backoff=1.101356452
Jun 29 07:01:42 ote001spot05-i-0cad12ccb1577476a.ad.dice.fm docker_k8s_vault-agent_backend-7b6bd7984-9l4vg_edge_38c19fd8-b869-11ea-b9ec-062a67e343cc_0_640b1c1ca1af[9276]: 2020-0
Sometimes some pods are getting stuck like that, deploying 1500 pods in average 17 get stuck. Often half or more pods on node are fine, but other get stuck
Single node:
t25 partners-744ddcccbf-vftsx 1/1 Running 0 17m 10.64.153.78 ip-10-64-168-217.eu-west-1.compute.internal <none> <none>
t25 seatselection-77d4ddf89b-6nbw6 1/1 Running 0 15m 10.64.166.135 ip-10-64-168-217.eu-west-1.compute.internal <none> <none>
t26 seatselection-77d4ddf89b-mh9zk 1/1 Running 0 15m 10.64.134.196 ip-10-64-168-217.eu-west-1.compute.internal <none> <none>
t26 ticketing-687d94c666-t4k4w 1/1 Running 0 17m 10.64.150.152 ip-10-64-168-217.eu-west-1.compute.internal <none> <none>
t28 bowie-5f8855bcff-67mrq 1/1 Running 0 15m 10.64.173.8 ip-10-64-168-217.eu-west-1.compute.internal <none> <none>
t28 partners-744ddcccbf-dd2vh 0/1 Init:0/2 0 16m 10.64.152.173 ip-10-64-168-217.eu-west-1.compute.internal <none> <none>
t3 seatselection-77d4ddf89b-pjtbv 0/1 Init:0/2 0 21m 10.64.169.91 ip-10-64-168-217.eu-west-1.compute.internal <none> <none>
t5 webapp-8bcdc6bc-wn54v 0/1 Init:0/2 0 21m 10.64.135.103 ip-10-64-168-217.eu-west-1.compute.internal <none> <none>
t7 ticketing-687d94c666-slhnc 0/1 Init:0/2 0 21m 10.64.156.107 ip-10-64-168-217.eu-west-1.compute.internal <none> <none>
t8 hedwig-5596c54f5c-tqq7v 0/1 Init:0/2 0 20m 10.64.176.49 ip-10-64-168-217.eu-west-1.compute.internal <none> <none>
t8 kim-58656b9d6b-8qvmd 1/1 Running 0 20m 10.64.180.121 ip-10-64-168-217.eu-west-1.compute.internal <none> <none>
When I will delete aws-node pod from corresponding node they are all recovering instantly when new one will start.
I tried - in hope that maybe it will improve - adding to aws-node MINIMUM_IP_TARGET=30 and WARM_IP_TARGET=10. Later I also add “sleep” to aws-node, maybe it is placebo but it seems when new node is created by autoscaler it is better if pods are Scheaduled before aws-node witll start starting.
It “seems” much better, but even now it is sometimes stucking:
default debian-xcgv6 1/1 Running 0 107m 10.64.13.160 ip-10-64-62-184.eu-west-1.compute.internal <none> <none>
kube-system aws-node-s5jzc 1/1 Running 0 33s 10.64.62.184 ip-10-64-62-184.eu-west-1.compute.internal <none> <none>
kube-system kube-proxy-m8m74 1/1 Running 0 108m 10.64.62.184 ip-10-64-62-184.eu-west-1.compute.internal <none> <none>
prometheus node-cadvisor-bigspot-j6ffx 1/1 Running 0 107m 10.64.62.184 ip-10-64-62-184.eu-west-1.compute.internal <none> <none>
prometheus node-exporter-z77v6 1/1 Running 0 108m 10.64.62.184 ip-10-64-62-184.eu-west-1.compute.internal <none> <none>
search-fixes background-76ff6b85cb-4l5z5 1/1 Running 0 6m37s 10.64.18.71 ip-10-64-62-184.eu-west-1.compute.internal <none> <none>
Over there background-76ff6b85cb-4l5z5 was stuck until I killed previous aws-node-s5jzc pod, 30s later since new one and it worked
https://drive.google.com/drive/folders/1dfjrcBOaUc3cE7gEfcLZHlpBh_Lv2oGZ?usp=sharing
More logs:
[root@ote001spot08-127921709917439 ~]# grep background-85c5ddbf9b-jdsll /var/log/aws-routed-eni/ipamd.log
{"level":"debug","ts":"2020-07-03T10:01:34.904Z","caller":"k8sapi/discovery.go:216","msg":"Update for pod background-85c5ddbf9b-jdsll: Pending"}
{"level":"info","ts":"2020-07-03T10:01:34.904Z","caller":"k8sapi/discovery.go:216","msg":"Add/Update for Pod background-85c5ddbf9b-jdsll on my node, namespace = a1, IP = "}
{"level":"debug","ts":"2020-07-03T10:01:34.926Z","caller":"k8sapi/discovery.go:216","msg":"Update for pod background-85c5ddbf9b-jdsll: Pending"}
{"level":"info","ts":"2020-07-03T10:01:34.926Z","caller":"k8sapi/discovery.go:216","msg":"Add/Update for Pod background-85c5ddbf9b-jdsll on my node, namespace = a1, IP = "}
{"level":"debug","ts":"2020-07-03T10:06:22.397Z","caller":"k8sapi/discovery.go:216","msg":"Update for pod background-85c5ddbf9b-jdsll: Pending"}
{"level":"info","ts":"2020-07-03T10:06:22.397Z","caller":"k8sapi/discovery.go:216","msg":"Add/Update for Pod background-85c5ddbf9b-jdsll on my node, namespace = a1, IP = "}
Describe when stuck:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 9m45s default-scheduler Successfully assigned a1/background-85c5ddbf9b-jdsll to ip-10-64-154-188.eu-west-1.compute.internal
Warning FailedCreatePodSandBox 9m16s (x13 over 9m43s) kubelet, ip-10-64-154-188.eu-west-1.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "background-85c5ddbf9b-jdsll": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:319: getting the final child's pid from pipe caused \"EOF\"": unknown
Normal SandboxChanged 4m43s (x144 over 9m42s) kubelet, ip-10-64-154-188.eu-west-1.compute.internal Pod sandbox changed, it will be killed and re-created.
Plugin logs:
{"level":"info","ts":"2020-07-02T09:40:06.799Z","caller":"driver/driver.go:316","msg":"Delete toContainer rule for 10.64.154.149/32 "}
{"level":"info","ts":"2020-07-02T09:40:06.800Z","caller":"driver/driver.go:316","msg":"Delete fromContainer rule for 10.64.154.149/32 in table 2"}
{"level":"debug","ts":"2020-07-02T09:40:06.800Z","caller":"driver/driver.go:316","msg":"Tear down of NS complete"}
{"level":"info","ts":"2020-07-02T11:26:41.320Z","caller":"routed-eni-cni-plugin/cni.go:252","msg":"Received CNI del request: ContainerID(4f69bc0a130e34d97457635a8fe235f54a6b278563e627ce5ab5b7659ae2f4e3) Netns(/proc/5754/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=prod-7529;K8S_POD_NAME=background-84fd9c4d49-qdqqt;K8S_POD_INFRA_CONTAINER_ID=4f69bc0a130e34d97457635a8fe235f54a6b278563e627ce5ab5b7659ae2f4e3) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.3.1\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"Debug\",\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})"}
{"level":"debug","ts":"2020-07-02T11:26:41.327Z","caller":"routed-eni-cni-plugin/cni.go:322","msg":"TeardownNS: addr 10.64.131.128/32, table 2"}
{"level":"info","ts":"2020-07-02T11:26:41.327Z","caller":"driver/driver.go:316","msg":"Delete toContainer rule for 10.64.131.128/32 "}
{"level":"info","ts":"2020-07-02T11:26:41.328Z","caller":"driver/driver.go:316","msg":"Delete fromContainer rule for 10.64.131.128/32 in table 2"}
{"level":"debug","ts":"2020-07-02T11:26:41.328Z","caller":"driver/driver.go:316","msg":"Tear down of NS complete"}
{"level":"info","ts":"2020-07-02T12:25:37.856Z","caller":"routed-eni-cni-plugin/cni.go:128","msg":"Received CNI add request: ContainerID(a43c5abc04202bab9a0769834bc61e18b64ddbd0b7e52594aecd92877fa56f54) Netns(/proc/22410/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=discovery-badges;K8S_POD_NAME=background-7c4dbf5b96-zjs9r;K8S_POD_INFRA_CONTAINER_ID=a43c5abc04202bab9a0769834bc61e18b64ddbd0b7e52594aecd92877fa56f54) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.3.1\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"Debug\",\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})"}
{"level":"debug","ts":"2020-07-02T12:25:37.857Z","caller":"routed-eni-cni-plugin/cni.go:128","msg":"MTU value set is 9001:"}
{"level":"info","ts":"2020-07-02T12:25:37.860Z","caller":"routed-eni-cni-plugin/cni.go:128","msg":"Received add network response for pod background-7c4dbf5b96-zjs9r namespace discovery-badges sandbox a43c5abc04202bab9a0769834bc61e18b64ddbd0b7e52594aecd92877fa56f54: 10.64.147.197, table 0, external-SNAT: false, vpcCIDR: [10.64.0.0/16]"}
{"level":"debug","ts":"2020-07-02T12:25:37.860Z","caller":"routed-eni-cni-plugin/cni.go:203","msg":"SetupNS: hostVethName=eni9d2f9c4fffa, contVethName=eth0, netnsPath=/proc/22410/ns/net, table=0, mtu=9001"}
{"level":"debug","ts":"2020-07-02T12:25:37.887Z","caller":"driver/driver.go:173","msg":"SetupNS: disabled IPv6 RA and ICMP redirects on eni9d2f9c4fffa"}
{"level":"debug","ts":"2020-07-02T12:25:37.895Z","caller":"driver/driver.go:173","msg":"Setup host route outgoing hostVeth, LinkIndex 262"}
{"level":"debug","ts":"2020-07-02T12:25:37.896Z","caller":"driver/driver.go:173","msg":"Successfully set host route to be 10.64.147.197/0"}
{"level":"info","ts":"2020-07-02T12:25:37.923Z","caller":"driver/driver.go:173","msg":"Added toContainer rule for 10.64.147.197/32"}
This time it actually got the IP, but after 11 minutes, most of the time it is not as lucky (there was only 6 pods on this node, so all IPs were already attached to VM waiting to be used MINIMUM_IP_TARGET=30)
~❯ kubectl --namespace a1 get pod -o wide|grep background
background-85c5ddbf9b-jdsll 1/1 Running 0 14m 10.64.166.246 ip-10-64-154-188.eu-west-1.compute.internal <none> <none>
Problem started last week, previously we were using aws-node 1.5.X for months, but after issues upgraded to current one last Friday - what didn’t help
We have over 100 nodes, this is cluster where developers are creating their personal environments, so as result of such big rotation (sometimes thousand of pods per day) issue happens between every couple of days, and sometimes even 2 times per day when there is big activity, on random node(s). So it seems like edge case issue, 98%ish of pods are fine, but with big activity it becomes a issue. When creating 20 environments (around 1500 pods in total), it is almost certain that 5-20 pods will get stuck
Just to make it clear, it doesn’t only happen during some crazy activity, at some rare cases it happened when deployed a single pod to node - but deploying tones of pods seems to be effective way to check if issue is still there or is it fixed. Yesterday at 5PM I deployed 1500 pods 4 times without any issue, but today got a single stuck pod again. Maybe it is some issue in AWS side, some endpoint issue what might be used by aws-node ?
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 24 (7 by maintainers)
@rochacon We are planning to get a v1.7.2 out before the end of next week, if all testing continues to go well. We don’t plan to backport this to the v1.6.x branch, too much has changed in the code base, so an upgrade will be required.
This is what I am running every minute in cron:
Seems it is doing the trick
@jayanthvn
The stucked pod is
When I check routing table 2 and 3, third one is empty:
The changes in #1177 fixes this issue.
@michalzxc Thank you for reporting these issues. We are actively working on the fix. Sorry for the delayed response.
@mogren Hey, could you take a look at this one as well?
I’ve been having the same issue with v1.6.x, tried all versions of this series and still get new pods without a working network from time to time. Currently running v1.6.3. I was not able to reproduce this consistently, just as OP mentioned, the issue seems to be random. I seems to be related to IP reuse, I noticed that some of the stuck pods had the same IP as some old job pod (in completed status).
In my case, the stuck pods can’t access CoreDNS or other pods either, container restart has no effect, the only way I found to fix is to delete the bugged pod.
I wonder if this could be related to SNAT changes on 1.6? Or maybe IP cooldown?