amazon-vpc-cni-k8s: aws-node pod restarts without any obvious errors
EKS: v1.11.5
CNI: 602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:1.3.0
AMI: amazon-eks-node-1.11-v20181210 (ami-0a9006fb385703b54)
E.g:
$ kubectl -n kube-system get pods -l k8s-app=aws-node
NAME READY STATUS RESTARTS AGE
aws-node-2t4n4 1/1 Running 1 20h
aws-node-45j6l 1/1 Running 1 3h
aws-node-4mstw 1/1 Running 0 22h
aws-node-95smx 1/1 Running 0 1h
aws-node-9cz4c 1/1 Running 1 1h
aws-node-9nkzt 1/1 Running 0 3h
aws-node-9pfgq 1/1 Running 0 2h
aws-node-cr5ds 1/1 Running 1 1h
aws-node-hhtrt 1/1 Running 2 1h
aws-node-j8brm 1/1 Running 0 6d
aws-node-jvvgc 1/1 Running 1 1h
aws-node-kd7ld 1/1 Running 1 22h
aws-node-mr7dh 1/1 Running 1 1h
aws-node-ntn57 1/1 Running 1 1h
aws-node-tntxp 1/1 Running 1 2h
aws-node-vk6cz 1/1 Running 0 2h
aws-node-vtpz7 1/1 Running 1 4h
aws-node-xm9wz 1/1 Running 1 1h
Even if I describe
pod aws-node-hhtrt
there is no events. No interesting logs from the pod either. Or the previous pod. I looked in our logging system to get all logs from this pod and there is nothing beyond the normal startup messages. But I did see from pod aws-node-9cz4c
this message:
Failed to communicate with K8S Server. Please check instance security groups or http proxy setting
I tried to run /opt/cni/bin/aws-cni-support.sh
on the node with pod aws-node-hhtrt
but I get this error:
[root@ip-10-0-25-4 ~]# /opt/cni/bin/aws-cni-support.sh
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1223 100 1223 0 0 1223 0 0:00:01 --:--:-- 0:00:01 1194k
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 912 100 912 0 0 912 0 0:00:01 --:--:-- 0:00:01 890k
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 106 100 106 0 0 106 0 0:00:01 --:--:-- 0:00:01 103k
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 83 100 83 0 0 83 0 0:00:01 --:--:-- 0:00:01 83000
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 28 100 28 0 0 28 0 0:00:01 --:--:-- 0:00:01 28000
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 6268 100 6268 0 0 6268 0 0:00:01 --:--:-- 0:00:01 6121k
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed to connect to localhost port 10255: Connection refused
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 3
- Comments: 33 (19 by maintainers)
Any ETA for a fix?
@kivagant-ba Also, this PR that got merged into 1.14 seems relevant https://github.com/kubernetes/kubernetes/pull/70994.
JFYI: I run into the issue and after upgrading aws-node (amazon-k8s-cni) to 1.4 got the error from the pod on the same node.
Pod restart didn’t help.
All other nodes were ok and still are ok. I terminated the broken node to replace with another one.
Yea this looks like a race condition where aws-node (with its kubernetes client code) comes up before kube-proxy (and possibly kube-dns/coredns) has set up the
kubernetes.svc.default.cluster.local
DNS address. If you run the following command on your EKS cluster, you’ll see that kube-proxy uses the public endpoint.We need to update CNI to have a similar flag so it isn’t dependent on DNS or kube-proxy being up.
Hey @jaypipes!
I’ve tried with a couple different version from
1.5.0
to the current one I’m using1.5.5
but the issue is still persistent. The K8S version is1.14.10
. The full log I’m seeing is:====== Installing AWS-CNI ====== ====== Starting amazon-k8s-agent ====== ERROR: logging before flag.Parse: E0108 17:39:47.947983 9 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://100.66.122.254:443/api?timeout=32s: x509: certificate is valid for 100.64.0.1, 127.0.0.1, not 100.66.122.254) Failed to communicate with K8S Server. Please check instance security groups or http proxy setting%
Notably100.64.0.1
seems to be the correct IP for the kube-apiserver (as determined from the ClusterIP of the kubernetes service in the default namespace).Okay while typing this I managed to find the root cause of the issue. In my case the issue was that there was a kubernetes SVC in the kube-system namespace and one in the default namespace. Apparently the CNI pod first looks for a kubernetes SVC in the kube-system namespace (which is not maintained or updated by the controller). (Additionally, even though the kubernetes SVC in the kube-system namespace was load-balancing to the correct IPs, the cert was only valid for the IP of the kubernetes SVC in the default namespace.) Simply deleting the SVC in the kube-system namespace fixed the issue.
I’m not sure if this is a bug or intended behavior. But it should probably be noted somewhere that an SVC with the name kubernetes in the kube-system namespace will cause the CNI to fail if it’s not properly set up to talk to the apiserver (even if there’s a properly configured SVC in the default namespace). I think it might make sense to reverse the lookup order for the kubernetes SVC and prioritize the default namespace since that’s the one designed to talk to the apiserver per the K8S docs: https://kubernetes.io/docs/tasks/administer-cluster/access-cluster-api/#directly-accessing-the-rest-api-1