karpenter-provider-aws: Karpenter nodes terminates immediately due to controller.interruption initiating delete from interruption message

Description

Observed Behavior: Karpenter nodes get launched and deleted immediately, and pods get stuck at pending

Expected Behavior: Expected behaviour to allocate pods and nodes to be not killed

Reproduction Steps (Please include YAML):

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  labels:
    intent: apps
limits:
  resources:
    cpu: 1k
  providerRef:
    name: default
  requirements:
  - key: karpenter.k8s.aws/instance-category
    operator: In
    values:
    - m
  - key: karpenter.k8s.aws/instance-cpu
    operator: In
    values:
    - "2"
   - key: karpenter.k8s.aws/instance-generation
    operator: In
    values:
    - "5"
    - "6"
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - spot
 - key: kubernetes.io/arch
    operator: In
    values:
    - amd64
  - key: kubernetes.io/os
       operator: In
       values:
       - linux
  ttlSecondsAfterEmpty: 30

Versions:

  • Chart Version: v0.31.0
  • Kubernetes Version (kubectl version): v1.27
2023-10-26T09:41:06.888Z    INFO    controller.machine.lifecycle    launched machine    {"commit": "322822a", "machine": "default-n2tl5", "provisioner": "default", "provider-id": "aws:///ap-southeast-1a/i-06381788068af94ad", "instance-type": "m5a.large", "zone": "ap-southeast-1a", "capacity-type": "on-demand", "allocatable": {"cpu":"1930m","ephemeral-storage":"44Gi","memory":"6903Mi","pods":"29","vpc.amazonaws.com/pod-eni":"9"}}
2023-10-26T09:41:07.098Z    INFO    controller.machine.lifecycle    launched machine    {"commit": "322822a", "machine": "default-hd9ld", "provisioner": "default", "provider-id": "aws:///ap-southeast-1a/i-0587faa3ca1b921b2", "instance-type": "m5a.large", "zone": "ap-southeast-1a", "capacity-type": "on-demand", "allocatable": {"cpu":"1930m","ephemeral-storage":"44Gi","memory":"6903Mi","pods":"29","vpc.amazonaws.com/pod-eni":"9"}}
2023-10-26T09:41:07.992Z    INFO    controller.interruption initiating delete from interruption message {"commit": "322822a", "queue": "Karpenter-eks-bh-drupal-cms-dev", "messageKind": "StateChangeKind", "machine": "default-c2vwb", "action": "CordonAndDrain"}
2023-10-26T09:41:08.077Z    INFO    controller.interruption initiating delete from interruption message {"commit": "322822a", "queue": "Karpenter-eks-bh-drupal-cms-dev", "messageKind": "StateChangeKind", "machine": "default-n2tl5", "action": "CordonAndDrain"}
2023-10-26T09:41:08.189Z    INFO    controller.interruption initiating delete from interruption message {"commit": "322822a", "queue": "Karpenter-eks-bh-drupal-cms-dev", "messageKind": "StateChangeKind", "machine": "default-hd9ld", "action": "CordonAndDrain"}
2023-10-26T09:41:08.506Z    INFO    controller.machine.termination  deleted machine {"commit": "322822a", "machine": "default-n2tl5", "provisioner": "default", "node": "", "provider-id": "aws:///ap-southeast-1a/i-06381788068af94ad"}
2023-10-26T09:41:08.513Z    INFO    controller.machine.termination  deleted machine {"commit": "322822a", "machine": "default-c2vwb", "provisioner": "default", "node": "", "provider-id": "aws:///ap-southeast-1a/i-02fbccda0dc610720"}
2023-10-26T09:41:08.537Z    INFO    controller.machine.termination  deleted machine {"commit": "322822a", "machine": "default-hd9ld", "provisioner": "default", "node": "", "provider-id": "aws:///ap-southeast-1a/i-0587faa3ca1b921b2"}

Why is Karpenter doing cordoning and draining? There are no other errors shown on the logs. We turned off consolidation and use ttlSecondsAfterEmpty

About this issue

  • Original URL
  • State: open
  • Created 8 months ago
  • Reactions: 5
  • Comments: 21 (11 by maintainers)

Most upvoted comments

@jonathan-innis you absolute legend. That was the issue! Looks like i had added the node role instead of the IRSA role to the KMS Principal. This explains why the instances were immediately terminating. Thank you so much!

Do you know if the instance is trying to encrypt any of the volumes? Typically when we are seeing instances fail to stay up as soon as they are launched, this typically means there is some type of default encryption key that is being used in the account and the NodeRole that is being used to launch instances doesn’t have something like kms:GenerateDataKeyWithoutPlaintext permissions against the encryption key.

Consolidation was mentioned in the initial description, yet it wasn’t mentioned when consolidation was turned off - before or after encountered issue.

p.s. As someone responsible for maintaining a solution that depends on Karpenter within a company, I’m interested in enabling consolidation option. However, my preference lies in ensuring a peaceful night’s rest rather than dealing with the consequences of production-ready features…

From logs it seen that instance is removed a second after creation… Haven’t seen such behaviour in case of disabled consolidation option.

Yet I see an inconsistance in definition and logged info: logs contain "capacity-type": "on-demand",

yet provider states:

key: karpenter.sh/capacity-type
    operator: In
    values:
    - spot

Not sure here, as we don’t use spot instances, but might indicate some bug or miss/not in sync conifguration.

p.s. I’m not a mainteiner, so please take my words as assumptions rather than statements.