amazon-vpc-cni-k8s: aws-node pod restarts without any obvious errors

EKS: v1.11.5 CNI: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.3.0 AMI: amazon-eks-node-1.11-v20181210 (ami-0a9006fb385703b54)

E.g:

$ kubectl -n kube-system get pods -l k8s-app=aws-node
NAME             READY   STATUS    RESTARTS   AGE
aws-node-2t4n4   1/1     Running   1          20h
aws-node-45j6l   1/1     Running   1          3h
aws-node-4mstw   1/1     Running   0          22h
aws-node-95smx   1/1     Running   0          1h
aws-node-9cz4c   1/1     Running   1          1h
aws-node-9nkzt   1/1     Running   0          3h
aws-node-9pfgq   1/1     Running   0          2h
aws-node-cr5ds   1/1     Running   1          1h
aws-node-hhtrt   1/1     Running   2          1h
aws-node-j8brm   1/1     Running   0          6d
aws-node-jvvgc   1/1     Running   1          1h
aws-node-kd7ld   1/1     Running   1          22h
aws-node-mr7dh   1/1     Running   1          1h
aws-node-ntn57   1/1     Running   1          1h
aws-node-tntxp   1/1     Running   1          2h
aws-node-vk6cz   1/1     Running   0          2h
aws-node-vtpz7   1/1     Running   1          4h
aws-node-xm9wz   1/1     Running   1          1h

Even if I describe pod aws-node-hhtrt there is no events. No interesting logs from the pod either. Or the previous pod. I looked in our logging system to get all logs from this pod and there is nothing beyond the normal startup messages. But I did see from pod aws-node-9cz4c this message:

Failed to communicate with K8S Server. Please check instance security groups or http proxy setting

I tried to run /opt/cni/bin/aws-cni-support.sh on the node with pod aws-node-hhtrt but I get this error:

[root@ip-10-0-25-4 ~]# /opt/cni/bin/aws-cni-support.sh
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1223  100  1223    0     0   1223      0  0:00:01 --:--:--  0:00:01 1194k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   912  100   912    0     0    912      0  0:00:01 --:--:--  0:00:01  890k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   106  100   106    0     0    106      0  0:00:01 --:--:--  0:00:01  103k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    83  100    83    0     0     83      0  0:00:01 --:--:--  0:00:01 83000
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    28  100    28    0     0     28      0  0:00:01 --:--:--  0:00:01 28000
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  6268  100  6268    0     0   6268      0  0:00:01 --:--:--  0:00:01 6121k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to connect to localhost port 10255: Connection refused

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 3
Comments: 33 (19 by maintainers)

Most upvoted comments

Any ETA for a fix?

+11

max-rocket-internet on Jan 29, 2019

@kivagant-ba Also, this PR that got merged into 1.14 seems relevant https://github.com/kubernetes/kubernetes/pull/70994.

mogren on Aug 22, 2019

JFYI: I run into the issue and after upgrading aws-node (amazon-k8s-cni) to 1.4 got the error from the pod on the same node.

➜ kubectl logs -n kube-system aws-node-xm4dd
=====Starting installing AWS-CNI =========
=====Starting amazon-k8s-agent ===========
ERROR: logging before flag.Parse: E0418 16:22:29.501108      11 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://172.20.0.1:443/api?timeout=32s: dial tcp 172.20.0.1:443: connect: connection timed out)
Failed to communicate with K8S Server. Please check instance security groups or http proxy setting%

Pod restart didn’t help.

All other nodes were ok and still are ok. I terminated the broken node to replace with another one.

kivagant-ba on Apr 19, 2019

Yea this looks like a race condition where aws-node (with its kubernetes client code) comes up before kube-proxy (and possibly kube-dns/coredns) has set up the kubernetes.svc.default.cluster.local DNS address. If you run the following command on your EKS cluster, you’ll see that kube-proxy uses the public endpoint.

$ kubectl get po \
  -n kube-system \
  -l k8s-app=kube-proxy \
  -o jsonpath='{range .items[0].spec.containers[0].command[2]}{@}{"\n"}{end}'

We need to update CNI to have a similar flag so it isn’t dependent on DNS or kube-proxy being up.

micahhausler on Jan 18, 2019

Hey @jaypipes!

I’ve tried with a couple different version from 1.5.0 to the current one I’m using 1.5.5 but the issue is still persistent. The K8S version is 1.14.10. The full log I’m seeing is: ====== Installing AWS-CNI ====== ====== Starting amazon-k8s-agent ====== ERROR: logging before flag.Parse: E0108 17:39:47.947983 9 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://100.66.122.254:443/api?timeout=32s: x509: certificate is valid for 100.64.0.1, 127.0.0.1, not 100.66.122.254) Failed to communicate with K8S Server. Please check instance security groups or http proxy setting% Notably 100.64.0.1 seems to be the correct IP for the kube-apiserver (as determined from the ClusterIP of the kubernetes service in the default namespace).

Okay while typing this I managed to find the root cause of the issue. In my case the issue was that there was a kubernetes SVC in the kube-system namespace and one in the default namespace. Apparently the CNI pod first looks for a kubernetes SVC in the kube-system namespace (which is not maintained or updated by the controller). (Additionally, even though the kubernetes SVC in the kube-system namespace was load-balancing to the correct IPs, the cert was only valid for the IP of the kubernetes SVC in the default namespace.) Simply deleting the SVC in the kube-system namespace fixed the issue.

I’m not sure if this is a bug or intended behavior. But it should probably be noted somewhere that an SVC with the name kubernetes in the kube-system namespace will cause the CNI to fail if it’s not properly set up to talk to the apiserver (even if there’s a properly configured SVC in the default namespace). I think it might make sense to reverse the lookup order for the kubernetes SVC and prioritize the default namespace since that’s the one designed to talk to the apiserver per the K8S docs: https://kubernetes.io/docs/tasks/administer-cluster/access-cluster-api/#directly-accessing-the-rest-api-1

aweis89 on Jan 8, 2020