amazon-eks-ami: CoreDNS version breaks networking in pods

What happened:

I created two nodegroups. The one that had --register-with-taints passed into the --kubelet-extra-args does not have a working docker networking from inside pods.

What you expected to happen:

Networking inside a pod on a tainted node should work.

How to reproduce it (as minimally and precisely as possible):

Create a nodegroup with taints and labels.

I am creating two node groups and passing in --register-with-taints to the kubelet of one of them. The group that has extra argument to kubelet registers the taints properly but then networking inside of the containers on those nodes stop working. Nothing in the logs for aws-node, kubelet, or the CNI. Everything is fine when this same nodegroup starts up without the kubelet extra argument --register-with-taints.

The taint is passed to bootstrap.sh as such:

--kubelet-extra-args '--register-with-taints=\"dedicated=jobs:NoSchedule\" --node-labels=testing/role=shared'

Using the latest ami for us-east-1 and using amazon-k8s-cni:v1.3.2 image for aws-node ds.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 31 (2 by maintainers)

Most upvoted comments

I was facing DNS issues running more than one node group on EKS Cluster. My problem had nothing to do with taints flags, I was creating a new group using my own version of AWS provided CFN template (https://github.com/awslabs/amazon-eks-ami/blob/master/amazon-eks-nodegroup.yaml ). The issue here was that the core-dns was running only on the first node I had launched. When I launched the second node a new security-group was created:

NodeSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Security group for all nodes in the cluster
      VpcId: !Ref VpcId

Thus only pods from the same node group could communicate with each other. Therefore was impossible to reach coredns on port 53 on another node group in another security group.

Workaround: Use the same “NodeSecurityGroup” for all node groups. What I did was create this SG on EKS Cluster template and export it to NodeGroup template.

I can confirm this is an issue. This looks more like a CNI issue than the AMI.

Closing in favor of aws/amazon-vpc-cni-k8s#328

Alright, CNI is actually not the problem. Turns out to be a coredns version issue on EKS v1.11.

Final Solution: I followed some issues related to coredns and it led me to this issue here https://github.com/aws/containers-roadmap/issues/129 . I upgraded coredns image to version 1.3.1 and everything is working and dns resolution is fast again.

Please think about making cni version configurable and upgrading coredns version:

        image: coredns/coredns:1.3.1

I also needed to add rules for the sec-groups that were crearly not outlined in the sec-group configuration document.

Allowing port 53 UDP and TCP in between all the nodegroups.

That would be alright with me, just haven’t found a definitive answer on that. (Actually would even prefer if they didn’t install it at all, that would make ownership clear).

I am experiencing this issue with EKS 1.20 CNI image: amazon-k8s-cni:v1.7.5-eksbuild.1 coreDNS image: coredns:v1.8.3-eksbuild.1

I’ve launched two managed node groups with terraform. One with a taint and one with out. Launching a dnsutils pod in each node shows that nslookup errors out on the node with the taint

I launched a 1.11 EKS cluster with the taints you specified, and noticed that CoreDNS was unable to launch.

The default tolerations on CoreDNS (v1.1.3) were:

      tolerations:
      - key: "CriticalAddonsOnly"
        operator: "Exists"

I updated the toleration on the CoreDNS deployment to

      tolerations:
      - operator: "Exists"

by running

kubectl patch -n kube-system deployment/coredns --patch \
    '{"spec":{"template":{"spec":{"tolerations": [{"operator": "Exists"} ]}}}}'

After updating the coredns tolerations, I was able to get pod networking to work with CNI v1.3.2. Can you see if updating the tolerations on the CoreDNS deployment for v1.1.3 resolves the issue for you?

@0verc1ocker I just deployed a single nginx pod onto a node running version 1.3.0 of the CNI container and it works for me. Did you delete the aws-node pod on the instance on which you were doing your tests after downgrading the image? By default, the aws-node daemonset doesn’t replace pods on updates.

We are using multiple node groups with different taints and labels and by now didn’t experience the issue you describe (at least not up until 1.3.0, we didn’t try any newer version yet).

You mention that you use different security groups. Is the instance where the problem occurs allowed to access the instance where core-dns / kube-dns is running? If not, this would explain your resolve errors.