amazon-eks-ami: 1.26 nodes fail to join the cluster with custom VPC domain-name

What happened: 1.26 AMI Nodes fail to join 1.26 clusters. In both scenarios of upgrading from 1.25 to 1.26 and new clusters starting fresh with 1.26

What you expected to happen: The nodes to join the cluster

Anything else we need to know?:

  • Using manged node groups.
  • The exact same Terraform deployment configuration works on 1.25. The only thing changed is the version for cluster/ami which triggers the failure on both upgrades and new clusters.
  • VPC DHCP domain name is in the format: ec2.internal acmedev.com

Environment:

  • AWS Region: us-east-1
  • Instance Type(s): m6a
  • EKS Platform version: "eks.1"
  • Kubernetes version: "1.26"
  • AMI Version: amazon-eks-node-1.26-v20230406
  • Kernel (e.g. uname -a): Linux ip-10-100-13-0.ec2.internalacmedev.com 5.10.173-154.642.amzn2.x86_64 #1 SMP Wed Mar 15 00:26:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Release information (run cat /etc/eks/release on a node):
BASE_AMI_ID="ami-099e00fe4091e48af"
BUILD_TIME="Thu Apr  6 01:36:39 UTC 2023"
BUILD_KERNEL="5.10.173-154.642.amzn2.x86_64"
ARCH="x86_64"

I believe the change in cloud-provider from aws to external has created an issue where our hostname for the kubelet is different between 1.25 and 1.26. This causes the aws iam authenticator node bootstrap logic to fail to register with the cluster because the hostname in the requests are not the same.

hostnamed logs are the exact same on 1.25 and 1.26 nodes, including the “warning”

Apr 13 13:55:48 ip-10-100-13-0 systemd-hostnamed: Changed pretty host name to 'ip-10-100-13-0.ec2.internal acmedev.com'
Apr 13 13:55:48 ip-10-100-13-0 systemd-hostnamed: Changed static host name to 'ip-10-100-13-0.ec2.internalacmedev.com'
Apr 13 13:55:48 ip-10-100-13-0 systemd-hostnamed: Changed host name to 'ip-10-100-13-0.ec2.internalacmedev.com'
Apr 13 13:55:48 ip-10-100-13-0 cloud-init: Apr 13 13:55:48 cloud-init[2209]: util.py[WARNING]: Failed to non-persistently adjust the system hostname to ip-10-100-13-0.ec2.internal acmedev.com

We are not changing any of the kubelet arguments from their AMI defaults. The only thing we are doing is adding some labels/taints to the nodes via the managed node group terraform resources. No hostname overrides.

Apr 13 13:55:53 ip-10-100-13-0 kubelet: I0413 13:55:53.946396    2944 flags.go:64] FLAG: --cloud-provider="external"
Apr 13 13:55:53 ip-10-100-13-0 kubelet: I0413 13:55:53.946638    2944 flags.go:64] FLAG: --hostname-override=""

Pertinent messages that indicate node join failures.

Apr 13 13:55:54 ip-10-100-13-0 kubelet: I0413 13:55:54.192348    2944 kubelet_node_status.go:669] "Recording event message for node" node="ip-10-100-13-0.ec2.internalacmedev.com" event="NodeHasNoDiskPressure"
Apr 13 13:55:54 ip-10-100-13-0 kubelet: I0413 13:55:54.192745    2944 kubelet_node_status.go:669] "Recording event message for node" node="ip-10-100-13-0.ec2.internalacmedev.com" event="NodeHasSufficientPID"
Apr 13 13:55:54 ip-10-100-13-0 kubelet: I0413 13:55:54.193204    2944 kubelet_node_status.go:70] "Attempting to register node" node="ip-10-100-13-0.ec2.internalacmedev.com"
Apr 13 13:55:54 ip-10-100-13-0 kubelet: E0413 13:55:54.765164    2944 controller.go:146] failed to ensure lease exists, will retry in 200ms, error: leases.coordination.k8s.io "ip-10-100-13-0.ec2.internalacmedev.com" is forbidden: User "system:node:ip-10-100-13-0.ec2.internal" cannot get resource "leases" in API group "coordination.k8s.io" in the namespace "kube-node-lease": can only access node lease with the same name as the requesting node
Apr 13 13:55:54 ip-10-100-13-0 kubelet: I0413 13:55:54.765885    2944 csi_plugin.go:913] Failed to contact API server when waiting for CSINode publishing: csinodes.storage.k8s.io "ip-10-100-13-0.ec2.internalacmedev.com" is forbidden: User "system:node:ip-10-100-13-0.ec2.internal" cannot get resource "csinodes" in API group "storage.k8s.io" at the cluster scope: can only access CSINode with the same name as the requesting node
Apr 13 13:55:54 ip-10-100-13-0 kubelet: E0413 13:55:54.766850    2944 kubelet_node_status.go:92] "Unable to register node with API server" err="nodes \"ip-10-100-13-0.ec2.internalacmedev.com\" is forbidden: node \"ip-10-100-13-0.ec2.internal\" is not allowed to modify node \"ip-10-100-13-0.ec2.internalacmedev.com\"" node="ip-10-100-13-0.ec2.internalacmedev.com"
Apr 13 13:55:54 ip-10-100-13-0 kubelet: I0413 13:55:54.969984    2944 kubelet_node_status.go:70] "Attempting to register node" node="ip-10-100-13-0.ec2.internalacmedev.com"
Apr 13 13:55:54 ip-10-100-13-0 kubelet: E0413 13:55:54.972246    2944 kubelet_node_status.go:92] "Unable to register node with API server" err="nodes \"ip-10-100-13-0.ec2.internalacmedev.com\" is forbidden: node \"ip-10-100-13-0.ec2.internal\" is not allowed to modify node \"ip-10-100-13-0.ec2.internalacmedev.com\"" node="ip-10-100-13-0.ec2.internalacmedev.com"

On the 1.25 nodes using cloud-provider=aws we can see the logs like:

Apr 12 15:14:31 ip-10-100-12-210 kubelet: I0412 15:14:31.176819    2906 server.go:993] "Cloud provider determined current node" nodeName="ip-10-100-12-210.ec2.internal"

https://github.com/kubernetes/kubernetes/blob/v1.26.2/cmd/kubelet/app/server.go#L989 which does not contain the acmedev.com appended to it.

The nodename returned in 1.25 aligns with with the templated private DNS name returned from the https://github.com/kubernetes-sigs/aws-iam-authenticator/tree/master that allows bootstrapping nodes. Since we are not using the aws cloud provider in 1.26 we might be getting back a different value for nodename which does not align.

Since the change to the cloud-provider=external I beieve we are returning the hostname that we would get from hostname or uname -n e.g. ip-10-100-13-0.ec2.internalacmedev.com which does not align with what is returned from the EC2 api when getting the private DNS name for auth. Our node config in the aws-auth cm is standard:

  mapRoles: |
    - "groups":
      - "system:bootstrappers"
      - "system:nodes"
      "rolearn": "arn:aws:iam::1234567890:role/role-name"
      "username": "system:node:{{EC2PrivateDNSName}}"

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 21 (11 by maintainers)

Most upvoted comments

@cartermckinnon We’re also running into this issue similar to @pnpn. Using the same amazon-eks-node-1.26-v20230411 AMI, no custom domain names on the nodes, upgrade from 1.25 to 1.26

The v20230501 release has started now, and it includes changes in #1264 that should fix this issue. New AMIs should be available in all regions late tonight (PDT).

Is it intentional that the DHCP domain-name is ec2.internal acmedev.com (with a space character between ec2.internal and acmedev.com)?

It was intentional when it was created many years ago, but I can only guess at the reason right now. It does seem like the space separated domain names is supported by AWS for creating /etc/resolv.conf but not entirely sure about the usage of the internal domain and the published domain pattern.

Something does seem awry in the authenticator, regardless. I’m looking into it. If you’d like to email me your nodegroup ARN, that would help: [removed]

Done!