karpenter-provider-aws: Karpenter nodes terminates immediately due to controller.interruption initiating delete from interruption message
Description
Observed Behavior: Karpenter nodes get launched and deleted immediately, and pods get stuck at pending
Expected Behavior: Expected behaviour to allocate pods and nodes to be not killed
Reproduction Steps (Please include YAML):
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: default
spec:
labels:
intent: apps
limits:
resources:
cpu: 1k
providerRef:
name: default
requirements:
- key: karpenter.k8s.aws/instance-category
operator: In
values:
- m
- key: karpenter.k8s.aws/instance-cpu
operator: In
values:
- "2"
- key: karpenter.k8s.aws/instance-generation
operator: In
values:
- "5"
- "6"
- key: karpenter.sh/capacity-type
operator: In
values:
- spot
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: kubernetes.io/os
operator: In
values:
- linux
ttlSecondsAfterEmpty: 30
Versions:
- Chart Version: v0.31.0
- Kubernetes Version (
kubectl version
): v1.27
2023-10-26T09:41:06.888Z INFO controller.machine.lifecycle launched machine {"commit": "322822a", "machine": "default-n2tl5", "provisioner": "default", "provider-id": "aws:///ap-southeast-1a/i-06381788068af94ad", "instance-type": "m5a.large", "zone": "ap-southeast-1a", "capacity-type": "on-demand", "allocatable": {"cpu":"1930m","ephemeral-storage":"44Gi","memory":"6903Mi","pods":"29","vpc.amazonaws.com/pod-eni":"9"}}
2023-10-26T09:41:07.098Z INFO controller.machine.lifecycle launched machine {"commit": "322822a", "machine": "default-hd9ld", "provisioner": "default", "provider-id": "aws:///ap-southeast-1a/i-0587faa3ca1b921b2", "instance-type": "m5a.large", "zone": "ap-southeast-1a", "capacity-type": "on-demand", "allocatable": {"cpu":"1930m","ephemeral-storage":"44Gi","memory":"6903Mi","pods":"29","vpc.amazonaws.com/pod-eni":"9"}}
2023-10-26T09:41:07.992Z INFO controller.interruption initiating delete from interruption message {"commit": "322822a", "queue": "Karpenter-eks-bh-drupal-cms-dev", "messageKind": "StateChangeKind", "machine": "default-c2vwb", "action": "CordonAndDrain"}
2023-10-26T09:41:08.077Z INFO controller.interruption initiating delete from interruption message {"commit": "322822a", "queue": "Karpenter-eks-bh-drupal-cms-dev", "messageKind": "StateChangeKind", "machine": "default-n2tl5", "action": "CordonAndDrain"}
2023-10-26T09:41:08.189Z INFO controller.interruption initiating delete from interruption message {"commit": "322822a", "queue": "Karpenter-eks-bh-drupal-cms-dev", "messageKind": "StateChangeKind", "machine": "default-hd9ld", "action": "CordonAndDrain"}
2023-10-26T09:41:08.506Z INFO controller.machine.termination deleted machine {"commit": "322822a", "machine": "default-n2tl5", "provisioner": "default", "node": "", "provider-id": "aws:///ap-southeast-1a/i-06381788068af94ad"}
2023-10-26T09:41:08.513Z INFO controller.machine.termination deleted machine {"commit": "322822a", "machine": "default-c2vwb", "provisioner": "default", "node": "", "provider-id": "aws:///ap-southeast-1a/i-02fbccda0dc610720"}
2023-10-26T09:41:08.537Z INFO controller.machine.termination deleted machine {"commit": "322822a", "machine": "default-hd9ld", "provisioner": "default", "node": "", "provider-id": "aws:///ap-southeast-1a/i-0587faa3ca1b921b2"}
Why is Karpenter doing cordoning and draining? There are no other errors shown on the logs. We turned off consolidation and use ttlSecondsAfterEmpty
About this issue
- Original URL
- State: open
- Created 8 months ago
- Reactions: 5
- Comments: 21 (11 by maintainers)
@jonathan-innis you absolute legend. That was the issue! Looks like i had added the node role instead of the IRSA role to the KMS Principal. This explains why the instances were immediately terminating. Thank you so much!
Do you know if the instance is trying to encrypt any of the volumes? Typically when we are seeing instances fail to stay up as soon as they are launched, this typically means there is some type of default encryption key that is being used in the account and the NodeRole that is being used to launch instances doesn’t have something like
kms:GenerateDataKeyWithoutPlaintext
permissions against the encryption key.Consolidation was mentioned in the initial description, yet it wasn’t mentioned when consolidation was turned off - before or after encountered issue.
p.s. As someone responsible for maintaining a solution that depends on Karpenter within a company, I’m interested in enabling consolidation option. However, my preference lies in ensuring a peaceful night’s rest rather than dealing with the consequences of production-ready features…
From logs it seen that instance is removed a second after creation… Haven’t seen such behaviour in case of disabled consolidation option.
Yet I see an inconsistance in definition and logged info: logs contain
"capacity-type": "on-demand",
yet provider states:
Not sure here, as we don’t use spot instances, but might indicate some bug or miss/not in sync conifguration.
p.s. I’m not a mainteiner, so please take my words as assumptions rather than statements.