amazon-vpc-cni-k8s: IPAM-D fails to start on cluster with Custom Networking & Prefix Delegation enabled

What happened: New EKS worker nodes from a managed node group never becomes ready, and the aws-node daemonset pod enters CLBO stuck at “Retrying waiting for IPAM-D”.

Currently getting help via k8s-awscni-triage@amazon.com distro group so opening up this issue here as a formality.

Also have an AWS Support ticket where it seems that we’ve been ghosted, no response in 5+ days Case 10736510071.

Output of sudo bash /opt/cni/bin/aws-cni-support.sh on a problematic node has been emailed to k8s-awscni-triage@amazon.com yesterday with the case number mentioned above in the Subject line.

Our deployment is roughly based off of this EKS Blueprints example here with the following exceptions.

  • We’ve removed the custom AMI bits in favor of using the default AMI provided by EKS. Here is most of what we removed as I recall reading in one of the AWS blogs that max pod calculations no longer have to performed manually when using the latest AMI’s.
  • We’re deploying to existing VPC’s and Subnets, where as this EKS Blueprint example creates it’s VPC level dependencies and works without any issues in our testing. Something seems to break when we attempt to utilize Custom Networking in our VPC/Subnets specifically that we can’t quite figure out. Another thing to note is that in our existing VPC, we’re using custom DNS settings in our DHCP optionsets vs the default AmazonProvidedDNS. The SystemsManager EKS Worker Node troubleshooting tool seems to complain that we’re not using AmazonProvidedDNS but I’m not entirely sure how much that matters.

Environment:

  • Kubernetes version (use kubectl version): GitVersion:“v1.23.7-eks-4721010”
  • CNI Version: amazon-k8s-cni:v1.11.3-eksbuild.1
  • OS (e.g: cat /etc/os-release): “Amazon Linux”
  • Kernel (e.g. uname -a): 5.4.209-116.363.amzn2.x86_64

Additional logs: ipamd.log aws-node.log

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 16 (9 by maintainers)

Most upvoted comments

@jayanthvn - Thanks for the debugging call with the rest of the team today, please feel free to close this issue as per the following summary.

  • Calls to IMDS and EC2 from the EKS node worked as expected
  • aws-node daemonset liveness probe and initialDelaySeconds config increased to prevent restarts/CLBO
  • aws-node daemonset pod would continue to log
{"level":"info","ts":"2022-09-14T21:21:38.325Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
  • However now the IPAM log on the EC2 node started to log the following error (note that in original config when kubelet would restart the failed aws-node container due to failed liveness, we never actually got to the point of seeing this error)
{"level":"error","ts":"2022-09-14T20:49:25.135Z","caller":"ipamd/ipamd.go:463","msg":"Failed to call ec2:DescribeNetworkInterfaces for [eni-032f2110640c9769b]: WebIdentityErr: failed to retrieve credentials\ncaused by: RequestError: send request failed\ncaused by: Post \"https://sts.us-west-2.amazonaws.com"
  • Netcat to regional STS fails and returns private IP’s instead of the expected public endpoint for STS
[root@<redacted>aws-routed-eni]# nc -vz sts.us-west-2.amazonaws.com 443
Ncat: Connection to <redacted> failed: Connection timed out.
Ncat: Trying next address...
Ncat: Connection to <redacted> failed: Connection timed out.
Ncat: Trying next address...
Ncat: Connection timed out.
  • Checked VPC --> Endpoints and found that another team recently created a private DNS endpoint for https://sts.us-west-2.amazonaws.com which effectively hijacked our DNS calls to STS in our impacted VPC
  • Current fix was to update the security group on the new privateDNS endpoint for STS to allow access to our EKS nodegroups. Long term fix may be to remove the private endpoint for STS unless it is absolutely required

I don’t think my above assumption about logger is correct, let me try locally and get back.