amazon-vpc-cni-k8s: Cannot scale new nodes with CNI version 1.5.3
When trying to scale new nodes after creating an EKS cluster, the new nodes will not leave the NotReady state. The aws-node pod is in a CrashLoopBackOff state.
This seems to be from the file 10-aws.conflist not being placed on the node. It looks like the install script had a line removed that added the file, and it was then moved to main.go. The problem is that it never reaches the code to copy the file
bash-4.2# cat /host/var/log/aws-routed-eni/ipamd.log.2019-09-25-20
2019-09-25T20:01:32.800Z [INFO] Starting L-IPAMD v1.5.3 ...
2019-09-25T20:02:02.803Z [INFO] Testing communication with server
2019-09-25T20:02:32.803Z [INFO] Failed to communicate with K8S Server. Please check instance security groups or http proxy setting
2019-09-25T20:02:32.803Z [ERROR] Failed to create client: error communicating with apiserver: Get https://172.20.0.1:443/version?timeout=32s: dial tcp 172.20.0.1:443: i/o timeout
2019-09-25T20:02:33.901Z [INFO] Starting L-IPAMD v1.5.3 ...
2019-09-25T20:03:03.903Z [INFO] Testing communication with server
I apologize but I wasn’t able to dig in far enough to know why it needs the 10-aws.conflist in /etc/cni/net.d first, but that does seem to be the issue.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 7
- Comments: 17 (7 by maintainers)
Commits related to this issue
- add configurable timeout for ipamd startup Adds a configurable timeout to the aws-k8s-agent (ipamd) startup in the entrypoint.sh script. Increases the default timeout from ~30 seconds to 60 seconds. ... — committed to jaypipes/amazon-vpc-cni-k8s by jaypipes 4 years ago
- Remove timeout for ipamd startup (#874) * add configurable timeout for ipamd startup Adds a configurable timeout to the aws-k8s-agent (ipamd) startup in the entrypoint.sh script. Increases the de... — committed to aws/amazon-vpc-cni-k8s by jaypipes 4 years ago
- Squashed commit of the following: commit d938e5e7590915a5126b2ee71fcc71b4ad7666f6 Author: Jayanth Varavani <1111446+jayanthvn@users.noreply.github.com> Date: Wed Jul 1 01:19:14 2020 +0000 Json... — committed to bnapolitan/amazon-vpc-cni-k8s by bnapolitan 4 years ago
To expand on what @cgkades is saying…
We’ve found is that
k8sapi.CreateKubeClient()
fails withFailed to create client: error communicating with apiserver: Get https://172.20.0.1:443/version?timeout=32s: dial tcp 172.20.0.1:443: i/o timeout
(https://github.com/aws/amazon-vpc-cni-k8s/blob/release-1.5.3/main.go#L47-L51).As such, it never gets to copy over the
/etc/cni/10-aws.conflist
. If we manually create the10-aws.conflist
all is right and aws-node starts up.This behavior isn’t in 1.5.1 and appears to be introduced in commit: 7fd7c93b5b3be2495b2b864d3a992840d4942a52
I believe that adding tolerations to our kube-proxy daemonset that allow it to start on nodes with NotReady status fixed this error for us.
AFAIK, the above statement is not correct.
kube-proxy
does not have anything to do with the aws-node Pod’s ability to contact the Kubernetes API server.kube-proxy
enables routing for a Kubernetes Service object’s backend Endpoints (other Pods).The timeout waiting for IPAMd (aws-k8s-agent) to come up is 30 seconds. My suspicion is that there is a problem with DNS resolution, not kube-proxy, that is causing the IPAMd startup to take longer than 30 seconds. On startup, IPAMd attempts to communicate with the Kubernetes API server using the
k8sclient.GetKubeClient()
function from the operator-framework@0.7 release (as @mogren notes above).Is there some custom DNS setup on these nodes? What is the value of your
KUBERNETES_SERVICE_HOST
env var? What does your /etc/resolv.conf look like?i seem to be also hit by this. using kops i tear up a cluster, e.g. with
kope.io/k8s-1.15-debian-stretch-amd64-hvm-ebs-2019-09-26
onkubernetesVersion: 1.16.4
the masters get up fine but with nodes its hit and miss. i also triedimageName: "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.6.0-rc5"
Thanks for the update, @marcel4303. I’m glad that we’re not the only ones experiencing this, and thanks for opening up a ticket with AWS support, too.
We’ve also got an internal high priority ticket open with AWS. In the mean time, we’ve been using aws cni 1.5.1. We’ll be sure to update this issue with relevant details as we hear back internally.