karpenter-provider-aws: Nodes never get provisioned for my workload

Hello, I am noticing following error in karpenter logs 2023-11-28T04:21:33.262Z DEBUG controller.machine.lifecycle terminating due to registration ttl {"commit": "322822a", "machine": "first-karpenter-provisioner-cn-northwest-1-xxxx", "provisioner": "first-karpenter-provisioner-cn-northwest-1", "ttl": "15m0s"} 2023-11-28T04:21:33.664Z INFO controller.machine.termination deleted machine {"commit": "322822a", "machine": "first-karpenter-provisioner-cn-northwest-1-xxxx", "provisioner": "first-karpenter-provisioner-cn-northwest-1", "node": "", "provider-id": "aws:///cn-northwest-1a/i-0030xxxxx"}

It never launches a ec2 instance for my workloads.

If I do kubectl describe machine first-karpenter-provisioner-cn-northwest-1-xxxx I see ` `` Conditions: Last Transition Time: 2023-11-28T04:52:02Z Message: Node not registered with cluster Reason: NodeNotFound Status: False Type: MachineInitialized Last Transition Time: 2023-11-28T04:52:02Z Status: True Type: MachineLaunched Last Transition Time: 2023-11-28T04:52:02Z Message: Node not registered with cluster Reason: NodeNotFound Status: False Type: Ready Last Transition Time: 2023-11-28T04:52:02Z Message: Node not registered with cluster Reason: NodeNotFound Status: False Type: MachineRegistered```

About this issue

  • Original URL
  • State: closed
  • Created 7 months ago
  • Reactions: 2
  • Comments: 19 (8 by maintainers)

Most upvoted comments

Hi @jmdeal, I think you maybe right about my issue being what’s described in - https://karpenter.sh/docs/troubleshooting/#node-terminates-before-ready-on-failed-encrypted-ebs-volume.

As a test, I assigned AdminAccess policy to Karpenter’s IRSA. The nodes are now being successfully provisioned, without any issues.

We have EBS encryption enabled at the region level.

I don’t quite understand the fix recommended in the troubleshooting link. The example policy given there is already applied to AWS managed Key that is used for EBS encryption. Your guidance would be appreciated.

UPDATE: Please ignore the above comment. I discovered that the EBS encryption KMS key we are using is not the default KMS key but a different, customer managed, KMS key. Karpenter IRSA does not have access to this KMS key. Here is the error in CloudTrail (for event name - GenerateDataKeyWithoutPlaintext):

"errorMessage": "User: arn:aws:sts::xxxxxxxx:assumed-role/karpenter-20231204131600471600000002/1701700196698666540 is not authorized to perform: kms:GenerateDataKeyWithoutPlaintext on resource: arn:aws:kms:eu-west-2:xxxxxxxxxxx:key/7dac2551-1e4d-4c5d-b02e-296ef7126f4e because no identity-based policy allows the kms:GenerateDataKeyWithoutPlaintext action",

UPDATE-2: Adding the below policy statement to Karpenter’s IRSA has fixed the issue for me:

{ #Allow access to KMS keys that will be used for EBS encryption
      actions   = [
        "kms:GenerateDataKeyWithoutPlaintext",
        "kms:Decrypt",
        "kms:CreateGrant",
      ]
      effect = "Allow"
      resources = ["arn:aws:kms:${var.aws_region}:${var.aws_account}:key/*"]
}

Thank you.

Yep, you can specify AMIs using AMI Selector Terms. Here’s an example with the last working 1.28 AMI:

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  amiFamily: Ubuntu
  role: KarpenterNodeRole-${CLUSTER_NAME}
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: ${CLUSTER_NAME}
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: ${CLUSTER_NAME}
  amiSelectorTerms:
    - name: ubuntu-eks/k8s_1.28/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20231128

Karpenter will delete and recreate machines every 15 minutes if the node fails to join the cluster, so that behavior is expected. There are quite a few reasons why a node may be unable to join. @marcustut @rubel-ahammad, could the two of you share your EC2NodeClass (and other karpenter resources)?

Please note, the karpenter configuration was working for more than a month or two for me. We didn’t update anything in the aws eks as well as in the karpenter side. This morning after deploying a build we noticed that it cannot register any new node. Same thing happened in two different aws instances of eks in two different regions.

I have similar scenarios, it was working for the past two weeks and it only started to happen on 30th Nov

Karpenter will delete and recreate machines every 15 minutes if the node fails to join the cluster, so that behavior is expected. There are quite a few reasons why a node may be unable to join. @marcustut @rubel-ahammad, could the two of you share your EC2NodeClass (and other karpenter resources)?

Please note, the karpenter configuration was working for more than a month or two for me. We didn’t update anything in the aws eks as well as in the karpenter side. This morning after deploying a build we noticed that it cannot register any new node. Same thing happened in two different aws instances of eks in two different regions.