amazon-vpc-cni-k8s: Cannot scale new nodes with CNI version 1.5.3

When trying to scale new nodes after creating an EKS cluster, the new nodes will not leave the NotReady state. The aws-node pod is in a CrashLoopBackOff state.

This seems to be from the file 10-aws.conflist not being placed on the node. It looks like the install script had a line removed that added the file, and it was then moved to main.go. The problem is that it never reaches the code to copy the file

bash-4.2# cat /host/var/log/aws-routed-eni/ipamd.log.2019-09-25-20
2019-09-25T20:01:32.800Z [INFO] Starting L-IPAMD v1.5.3  ...
2019-09-25T20:02:02.803Z [INFO] Testing communication with server
2019-09-25T20:02:32.803Z [INFO] Failed to communicate with K8S Server. Please check instance security groups or http proxy setting
2019-09-25T20:02:32.803Z [ERROR]        Failed to create client: error communicating with apiserver: Get https://172.20.0.1:443/version?timeout=32s: dial tcp 172.20.0.1:443: i/o timeout
2019-09-25T20:02:33.901Z [INFO] Starting L-IPAMD v1.5.3  ...
2019-09-25T20:03:03.903Z [INFO] Testing communication with server

I apologize but I wasn’t able to dig in far enough to know why it needs the 10-aws.conflist in /etc/cni/net.d first, but that does seem to be the issue.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 7
  • Comments: 17 (7 by maintainers)

Commits related to this issue

Most upvoted comments

To expand on what @cgkades is saying…

We’ve found is that k8sapi.CreateKubeClient() fails with Failed to create client: error communicating with apiserver: Get https://172.20.0.1:443/version?timeout=32s: dial tcp 172.20.0.1:443: i/o timeout (https://github.com/aws/amazon-vpc-cni-k8s/blob/release-1.5.3/main.go#L47-L51).

As such, it never gets to copy over the /etc/cni/10-aws.conflist. If we manually create the 10-aws.conflist all is right and aws-node starts up.

This behavior isn’t in 1.5.1 and appears to be introduced in commit: 7fd7c93b5b3be2495b2b864d3a992840d4942a52

The kubeClient endpoint is: https://172.20.0.1:443, which is not yet available to the system, since no kube-proxy is running at that time. No kube-proxy will start since no CNI files are in place yet, since the aws-k8s-agent hasn’t bootstrapped.

I believe that adding tolerations to our kube-proxy daemonset that allow it to start on nodes with NotReady status fixed this error for us.

tolerations:
      - operator: "Exists"
        effect: "NoExecute"
      - operator: "Exists"
        effect: "NoSchedule"

The kubeClient endpoint is: https://172.20.0.1:443, which is not yet available to the system, since no kube-proxy is running at that time. No kube-proxy will start since no CNI files are in place yet, since the aws-k8s-agent hasn’t bootstrapped. If aws-k8s-agent doesn’t bootstrap in a timely fashion, the container entrypoint.sh will fail out, and not copy any CNI files.

AFAIK, the above statement is not correct. kube-proxy does not have anything to do with the aws-node Pod’s ability to contact the Kubernetes API server. kube-proxy enables routing for a Kubernetes Service object’s backend Endpoints (other Pods).

The timeout waiting for IPAMd (aws-k8s-agent) to come up is 30 seconds. My suspicion is that there is a problem with DNS resolution, not kube-proxy, that is causing the IPAMd startup to take longer than 30 seconds. On startup, IPAMd attempts to communicate with the Kubernetes API server using the k8sclient.GetKubeClient() function from the operator-framework@0.7 release (as @mogren notes above).

Is there some custom DNS setup on these nodes? What is the value of your KUBERNETES_SERVICE_HOST env var? What does your /etc/resolv.conf look like?

i seem to be also hit by this. using kops i tear up a cluster, e.g. with kope.io/k8s-1.15-debian-stretch-amd64-hvm-ebs-2019-09-26 on kubernetesVersion: 1.16.4 the masters get up fine but with nodes its hit and miss. i also tried imageName: "602401143452.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.6.0-rc5"

Thanks for the update, @marcel4303. I’m glad that we’re not the only ones experiencing this, and thanks for opening up a ticket with AWS support, too.

We’ve also got an internal high priority ticket open with AWS. In the mean time, we’ve been using aws cni 1.5.1. We’ll be sure to update this issue with relevant details as we hear back internally.