karpenter-provider-aws: How to debug empty nodes that don't get terminated

Version

Karpenter: 0.16.1 EKS: 1.23.9

Expected Behavior

If only daemonset pods run on a node, which was provisioned by karpenter, after ttlSecondsAfterEmpty the node shall be terminated.

Actual Behavior

We regularly have running zombie nodes that seem to not get killed by karpenter.

I am looking for advice how to debug this:

  • is there a way to show the current value of the relevant ttlSecondsAfterEmpty counter or if those conditions are fulfilled? The logs do not show any relevant info in kubectl -n cicd-infra-karpenter logs -l app.kubernetes.io/instance=karpenter --all-containers | grep -i ttl
  • Any other debug logs I am missing?

Steps to Reproduce the Problem

  • Have a node that won’t shut down
  • Analyze it

Resource Specs and Logs

Daemonsets: (not sure why, but can it be possible that if a daemonset has a nodeselector class for karpenter it affects the termination behavior, see cicd-infra-dcgm-exporter)

fberchtold@W10-RIG:~/luminar/gitops-cicd$ kubectl get daemonsets.apps -A
NAMESPACE                           NAME                                  DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                AGE
cicd-infra-apm                      apm-k8s-infra-otel-agent              6         6         6       6            6           <none>                                                       20d
cicd-infra-aws-efs-csi-driver       efs-csi-node                          10        10        10      10           10          beta.kubernetes.io/os=linux                                  36d
cicd-infra-dcgm-exporter            dcgm-exporter                         4         4         4       4            4           class=cicd-prod-karpenter-provisioner-linux-x86-gpu-medium   11h
cicd-infra-loki                     loki-promtail                         6         6         6       6            6           <none>                                                       7d9h
cicd-infra-monitoring               monitoring-prometheus-node-exporter   10        10        10      10           10          <none>                                                       36d
cicd-infra-smarter-device-manager   cicd-infra-smarter-device-manager     6         6         6       6            6           <none>                                                       5d8h
kube-system                         aws-node                              10        10        10      10           10          <none>                                                       22d
kube-system                         ebs-csi-node                          10        10        10      10           10          kubernetes.io/os=linux                                       42d
kube-system                         ebs-csi-node-windows                  0         0         0       0            0           kubernetes.io/os=windows                                     42d
kube-system                         kube-proxy                            10        10        10      10           10          <none>                                                       45d

The node in question:

fberchtold@W10-RIG:~/luminar/gitops-cicd$ kubectl describe node  ip-10-3-11-122.us-west-2.compute.internal
Name:               ip-10-3-11-122.us-west-2.compute.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=g4dn.4xlarge
                    beta.kubernetes.io/os=linux
                    class=cicd-prod-karpenter-provisioner-linux-x86-gpu-medium
                    failure-domain.beta.kubernetes.io/region=us-west-2
                    failure-domain.beta.kubernetes.io/zone=us-west-2b
                    k8s.io/cloud-provider-aws=ae50b0c1761b634585af5353701af259
                    karpenter.k8s.aws/instance-category=g
                    karpenter.k8s.aws/instance-cpu=16
                    karpenter.k8s.aws/instance-family=g4dn
                    karpenter.k8s.aws/instance-generation=4
                    karpenter.k8s.aws/instance-gpu-count=1
                    karpenter.k8s.aws/instance-gpu-manufacturer=nvidia
                    karpenter.k8s.aws/instance-gpu-memory=16384
                    karpenter.k8s.aws/instance-gpu-name=t4
                    karpenter.k8s.aws/instance-hypervisor=nitro
                    karpenter.k8s.aws/instance-local-nvme=225
                    karpenter.k8s.aws/instance-memory=65536
                    karpenter.k8s.aws/instance-pods=29
                    karpenter.k8s.aws/instance-size=4xlarge
                    karpenter.sh/capacity-type=on-demand
                    karpenter.sh/provisioner-name=cicd-prod-karpenter-provisioner-linux-x86-gpu-medium
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-3-11-122.us-west-2.compute.internal
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=g4dn.4xlarge
                    topology.ebs.csi.aws.com/zone=us-west-2b
                    topology.kubernetes.io/region=us-west-2
                    topology.kubernetes.io/zone=us-west-2b
Annotations:        alpha.kubernetes.io/provided-node-ip: 10.3.11.122
                    csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-01e8f3a596e08564a","efs.csi.aws.com":"i-01e8f3a596e08564a"}
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 26 Oct 2022 11:03:11 -0700
Taints:             environment=cicd-prod:NoSchedule
                    type=linux-x86-gpu-medium:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-3-11-122.us-west-2.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Wed, 26 Oct 2022 17:48:04 -0700
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  Ready            True    Wed, 26 Oct 2022 17:43:36 -0700   Wed, 26 Oct 2022 11:05:06 -0700   KubeletReady                 kubelet is posting ready status
  MemoryPressure   False   Wed, 26 Oct 2022 17:43:36 -0700   Wed, 26 Oct 2022 11:04:36 -0700   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 26 Oct 2022 17:43:36 -0700   Wed, 26 Oct 2022 16:11:47 -0700   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 26 Oct 2022 17:43:36 -0700   Wed, 26 Oct 2022 11:04:36 -0700   KubeletHasSufficientPID      kubelet has sufficient PID available
Addresses:
  InternalIP:   10.3.11.122
  Hostname:     ip-10-3-11-122.us-west-2.compute.internal
  InternalDNS:  ip-10-3-11-122.us-west-2.compute.internal
Capacity:
  attachable-volumes-aws-ebs:  39
  cpu:                         16
  ephemeral-storage:           20959212Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      65043764Ki
  pods:                        29
Allocatable:
  attachable-volumes-aws-ebs:  39
  cpu:                         15890m
  ephemeral-storage:           18242267924
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      64353588Ki
  pods:                        29
System Info:
  Machine ID:                    ec2291b1f1a2c11a2328d9b1a8911e6f
  System UUID:                   ec2291b1-f1a2-c11a-2328-d9b1a8911e6f
  Boot ID:                       b9c9f69f-f135-480e-afbe-69b6c8aaa45a
  Kernel Version:                5.4.209-116.367.amzn2.x86_64
  OS Image:                      Amazon Linux 2
  Operating System:              linux
  Architecture:                  amd64
  Container Runtime Version:     containerd://1.6.6
  Kubelet Version:               v1.23.9-eks-ba74326
  Kube-Proxy Version:            v1.23.9-eks-ba74326
ProviderID:                      aws:///us-west-2b/i-01e8f3a596e08564a
Non-terminated Pods:             (6 in total)
  Namespace                      Name                                         CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                      ----                                         ------------  ----------  ---------------  -------------  ---
  cicd-infra-aws-efs-csi-driver  efs-csi-node-h9wln                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         6h45m
  cicd-infra-dcgm-exporter       dcgm-exporter-vn8k6                          0 (0%)        0 (0%)      0 (0%)           0 (0%)         6h45m
  cicd-infra-monitoring          monitoring-prometheus-node-exporter-s7lc6    0 (0%)        0 (0%)      0 (0%)           0 (0%)         6h45m
  kube-system                    aws-node-hwdms                               25m (0%)      0 (0%)      0 (0%)           0 (0%)         6h45m
  kube-system                    ebs-csi-node-bgb95                           30m (0%)      300m (1%)   120Mi (0%)       768Mi (1%)     6h45m
  kube-system                    kube-proxy-5sldj                             100m (0%)     0 (0%)      0 (0%)           0 (0%)         6h45m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests    Limits
  --------                    --------    ------
  cpu                         155m (0%)   300m (1%)
  memory                      120Mi (0%)  768Mi (1%)
  ephemeral-storage           0 (0%)      0 (0%)
  hugepages-1Gi               0 (0%)      0 (0%)
  hugepages-2Mi               0 (0%)      0 (0%)
  attachable-volumes-aws-ebs  0           0
Events:
  Type     Reason                 Age                    From     Message
  ----     ------                 ----                   ----     -------
  Normal   NodeHasDiskPressure    102m (x32 over 6h42m)  kubelet  Node ip-10-3-11-122.us-west-2.compute.internal status is now: NodeHasDiskPressure
  Warning  EvictionThresholdMet   101m (x60 over 6h42m)  kubelet  Attempting to reclaim ephemeral-storage
  Normal   NodeHasNoDiskPressure  96m (x35 over 6h43m)   kubelet  Node ip-10-3-11-122.us-west-2.compute.internal status is now: NodeHasNoDiskPressure

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave “+1” or “me too” comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 4
  • Comments: 38 (14 by maintainers)

Most upvoted comments

Confirmed, removing AWS_ENABLE_POD_ENI solves the issue for us, node got evicted after 30s of idle time 👍

See the other issue, but you either don’t enable AWS_ENABLE_POD_ENI or every provisioner needs vpc.amazonaws.com/has-trunk-attached: "false"

kubectl get nodes --label-columns karpenter.k8s.aws/instance-family,karpenter.sh/provisioner-name,kubernetes.io/arch,karpenter.sh/initialized
NAME                                        STATUS   ROLES    AGE   VERSION                INSTANCE-FAMILY   PROVISIONER-NAME                                       ARCH    INITIALIZED
ip-10-3-0-40.us-west-2.compute.internal     Ready    <none>   56d   v1.23.9-eks-ba74326                                                                             amd64
ip-10-3-1-208.us-west-2.compute.internal    Ready    <none>   56d   v1.23.9-eks-ba74326                                                                             amd64
ip-10-3-10-98.us-west-2.compute.internal    Ready    <none>   56d   v1.23.9-eks-ba74326                                                                             amd64
ip-10-3-11-166.us-west-2.compute.internal   Ready    <none>   56d   v1.23.9-eks-ba74326                                                                             amd64
ip-10-3-12-154.us-west-2.compute.internal   Ready    <none>   56d   v1.23.9-eks-ba74326                                                                             amd64
ip-10-3-12-81.us-west-2.compute.internal    Ready    <none>   59m   v1.23.12-eks-a64d4ad   c5                cicd-prod-karpenter-provisioner-linux-x86-cpu-medium   amd64
ip-10-3-2-143.us-west-2.compute.internal    Ready    <none>   56d   v1.23.9-eks-ba74326                                                                             amd64
kubectl describe node ip-10-3-12-81.us-west-2.compute.internal
Name:               ip-10-3-12-81.us-west-2.compute.internal
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=c5.4xlarge
                    beta.kubernetes.io/os=linux
                    class=cicd-prod-karpenter-provisioner-linux-x86-cpu-medium
                    failure-domain.beta.kubernetes.io/region=us-west-2
                    failure-domain.beta.kubernetes.io/zone=us-west-2c
                    feature.node.kubernetes.io/cpu-cpuid.ADX=true
                    feature.node.kubernetes.io/cpu-cpuid.AESNI=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX2=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512F=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true
                    feature.node.kubernetes.io/cpu-cpuid.FMA3=true
                    feature.node.kubernetes.io/cpu-cpuid.HLE=true
                    feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR=true
                    feature.node.kubernetes.io/cpu-cpuid.MPX=true
                    feature.node.kubernetes.io/cpu-cpuid.RTM=true
                    feature.node.kubernetes.io/cpu-hardware_multithreading=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true
                    feature.node.kubernetes.io/kernel-selinux.enabled=true
                    feature.node.kubernetes.io/kernel-version.full=5.10.135
                    feature.node.kubernetes.io/kernel-version.major=5
                    feature.node.kubernetes.io/kernel-version.minor=10
                    feature.node.kubernetes.io/kernel-version.revision=135
                    feature.node.kubernetes.io/pci-1d0f.present=true
                    feature.node.kubernetes.io/storage-nonrotationaldisk=true
                    feature.node.kubernetes.io/system-os_release.ID=bottlerocket
                    feature.node.kubernetes.io/system-os_release.VERSION_ID=1.10.1
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.major=1
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=10
                    k8s.io/cloud-provider-aws=ae50b0c1761b634585af5353701af259
                    karpenter.k8s.aws/instance-category=c
                    karpenter.k8s.aws/instance-cpu=16
                    karpenter.k8s.aws/instance-family=c5
                    karpenter.k8s.aws/instance-generation=5
                    karpenter.k8s.aws/instance-hypervisor=nitro
                    karpenter.k8s.aws/instance-memory=32768
                    karpenter.k8s.aws/instance-pods=234
                    karpenter.k8s.aws/instance-size=4xlarge
                    karpenter.sh/capacity-type=on-demand
                    karpenter.sh/provisioner-name=cicd-prod-karpenter-provisioner-linux-x86-cpu-medium
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-3-12-81.us-west-2.compute.internal
                    kubernetes.io/os=linux
                    node.kubernetes.io/instance-type=c5.4xlarge
                    topology.ebs.csi.aws.com/zone=us-west-2c
                    topology.kubernetes.io/region=us-west-2
                    topology.kubernetes.io/zone=us-west-2c
Annotations:        alpha.kubernetes.io/provided-node-ip: 10.3.12.81
                    csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0f1d81180b1384cf1","efs.csi.aws.com":"i-0f1d81180b1384cf1"}
                    nfd.node.kubernetes.io/extended-resources:
                    nfd.node.kubernetes.io/feature-labels:
                      cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.AVX512BW,cpu-cpuid.AVX512CD,cpu-cpuid.AVX512DQ,cpu-cpuid.AVX512F,cpu-...
                    nfd.node.kubernetes.io/worker.version: v0.10.1
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 09 Nov 2022 09:33:10 -0800
Taints:             environment=cicd-prod:NoSchedule
                    type=linux-x86-cpu-medium:NoSchedule
Unschedulable:      false
Lease:
  HolderIdentity:  ip-10-3-12-81.us-west-2.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Wed, 09 Nov 2022 10:32:54 -0800
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Wed, 09 Nov 2022 10:30:40 -0800   Wed, 09 Nov 2022 09:33:28 -0800   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 09 Nov 2022 10:30:40 -0800   Wed, 09 Nov 2022 09:33:28 -0800   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 09 Nov 2022 10:30:40 -0800   Wed, 09 Nov 2022 09:33:28 -0800   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 09 Nov 2022 10:30:40 -0800   Wed, 09 Nov 2022 09:33:48 -0800   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   10.3.12.81
  Hostname:     ip-10-3-12-81.us-west-2.compute.internal
  InternalDNS:  ip-10-3-12-81.us-west-2.compute.internal
Capacity:
  attachable-volumes-aws-ebs:  25
  cpu:                         16
  ephemeral-storage:           516052280Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      31817100Ki
  pods:                        234
Allocatable:
  attachable-volumes-aws-ebs:  25
  cpu:                         15890m
  ephemeral-storage:           474520038637
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      28817804Ki
  pods:                        234
System Info:
  Machine ID:                    ec2168529f1538cec7d7e63edd350aac
  System UUID:                   ec216852-9f15-38ce-c7d7-e63edd350aac
  Boot ID:                       570359d4-37e0-4c6b-948f-ea9687e971f0
  Kernel Version:                5.10.135
  OS Image:                      Bottlerocket OS 1.10.1 (aws-k8s-1.23)
  Operating System:              linux
  Architecture:                  amd64
  Container Runtime Version:     containerd://1.6.8+bottlerocket
  Kubelet Version:               v1.23.12-eks-a64d4ad
  Kube-Proxy Version:            v1.23.12-eks-a64d4ad
ProviderID:                      aws:///us-west-2c/i-0f1d81180b1384cf1
Non-terminated Pods:             (6 in total)
  Namespace                      Name                                                CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                      ----                                                ------------  ----------  ---------------  -------------  ---
  cicd-infra-aws-efs-csi-driver  efs-csi-node-pvxqf                                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         59m
  cicd-infra-gpu-operator        gpu-operator-node-feature-discovery-worker-768kl    0 (0%)        0 (0%)      0 (0%)           0 (0%)         59m
  cicd-infra-monitoring          monitoring-prometheus-node-exporter-bmcfb           0 (0%)        0 (0%)      0 (0%)           0 (0%)         59m
  kube-system                    aws-node-9dmzj                                      25m (0%)      0 (0%)      0 (0%)           0 (0%)         59m
  kube-system                    ebs-csi-node-sb6hn                                  30m (0%)      300m (1%)   120Mi (0%)       768Mi (2%)     59m
  kube-system                    kube-proxy-c4pp4                                    100m (0%)     0 (0%)      0 (0%)           0 (0%)         59m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests    Limits
  --------                    --------    ------
  cpu                         155m (0%)   300m (1%)
  memory                      120Mi (0%)  768Mi (2%)
  ephemeral-storage           0 (0%)      0 (0%)
  hugepages-1Gi               0 (0%)      0 (0%)
  hugepages-2Mi               0 (0%)      0 (0%)
  attachable-volumes-aws-ebs  0           0
Events:
  Type     Reason                   Age                From             Message
  ----     ------                   ----               ----             -------
  Normal   Starting                 59m                kube-proxy
  Normal   RegisteredNode           59m                node-controller  Node ip-10-3-12-81.us-west-2.compute.internal event: Registered Node ip-10-3-12-81.us-west-2.compute.internal in Controller
  Normal   Starting                 59m                kubelet          Starting kubelet.
  Warning  InvalidDiskCapacity      59m                kubelet          invalid capacity 0 on image filesystem
  Normal   NodeHasSufficientMemory  59m (x3 over 59m)  kubelet          Node ip-10-3-12-81.us-west-2.compute.internal status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    59m (x3 over 59m)  kubelet          Node ip-10-3-12-81.us-west-2.compute.internal status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     59m (x3 over 59m)  kubelet          Node ip-10-3-12-81.us-west-2.compute.internal status is now: NodeHasSufficientPID
  Normal   NodeAllocatableEnforced  59m                kubelet          Updated Node Allocatable limit across pods
  Normal   NodeReady                59m                kubelet          Node ip-10-3-12-81.us-west-2.compute.internal status is now: NodeReady

So this might be a different issue (that karpenter immediatly removes a node after init=true), but sounds like there could be a nice race condition happening:

  • karpenter provisions a new node
  • ttlSecondsAfterEmpty is set to 30s
  • node is part of the cluster, no pods running, ttlSecondsAfterEmpty logic starts counting
  • something is preventing the karpenter.sh/initialized to be set to true for more than 30s
  • after 30s, ttlSecondsAfterEmpty logic is fulfilled
  • sometimes later, karpenter.sh/initialized is set to true by karpenter and node is removed
  • rinse and repeat

@jonathan-innis happy to setup a debug session in case it helps you

Hmm, you say this is happening on v0.16.1 or is this on latest (v0.18.1)? We should not be considering emptiness until after the node is intialized (see here). If you are able to repro this consistently, it’s probably better to track this in a separate issue.

If I manually add the label:

kubectl get nodes --label-columns karpenter.k8s.aws/instance-family,karpenter.sh/provisioner-name,karpenter.sh/initialized
NAME                                        STATUS   ROLES    AGE    VERSION               INSTANCE-FAMILY   PROVISIONER-NAME   INITIALIZED
ip-10-2-64-192.us-east-2.compute.internal   Ready    <none>   30d    v1.23.9-eks-ba74326
ip-10-2-64-217.us-east-2.compute.internal   Ready    <none>   4d3h   v1.23.9-eks-ba74326
ip-10-2-64-222.us-east-2.compute.internal   Ready    <none>   23d    v1.23.9-eks-ba74326
ip-10-2-64-76.us-east-2.compute.internal    Ready    <none>   21m    v1.23.9-eks-ba74326   c6id              cicd-scaling       true
ip-10-2-65-96.us-east-2.compute.internal    Ready    <none>   21h    v1.23.9-eks-ba74326   g4dn              cicd-gpu-scaling
ip-10-2-66-213.us-east-2.compute.internal   Ready    <none>   30d    v1.23.9-eks-ba74326
ip-10-2-66-50.us-east-2.compute.internal    Ready    <none>   21m    v1.23.9-eks-ba74326   c6id              cicd-scaling       true
ip-10-2-67-253.us-east-2.compute.internal   Ready    <none>   2d     v1.23.9-eks-ba74326
ip-10-2-67-60.us-east-2.compute.internal    Ready    <none>   30d    v1.23.9-eks-ba74326
ip-10-2-68-141.us-east-2.compute.internal   Ready    <none>   12h    v1.23.9-eks-ba74326   g4dn              cicd-gpu-scaling
kubectl label nodes ip-10-2-65-96.us-east-2.compute.internal karpenter.sh/initialized=true
node/ip-10-2-65-96.us-east-2.compute.internal labeled

Almost immediately in the logs I see:

2022-11-03T10:07:14.274Z	INFO	controller.node	Added TTL to empty node	{"commit": "5d4ae35-dirty", "node": "ip-10-2-65-96.us-east-2.compute.internal"}

Using describe node I can now see the karpenter.sh/emptiness-timestamp on that node.

So I think this is a bug, GPU Instances are not always getting the karpenter.sh/initialized label, causing them to never be removed.

I am having the exact same issue. We have various karpenter scaling configurations, and interestingly it is only the GPU instances that get “stuck”. Not every GPU instance either, just the odd one now and again.

I notice that you also are using GPU instances… Perhaps this is significant to this issue?

I am also using EKS v1.23 and Karpenter v0.16.3 (via the helm chart of the same version).