karpenter-provider-aws: Expired nodes are stuck in `Ready,SchedulingDisabled` without any error

Version

Karpenter: v0.6.3 Kubernetes: v1.21.5

Expected Behavior

Nodes provisioned by Karpenter with expiry should be deleted after they are expired.

Actual Behavior

The nodes are stuck in Ready,SchedulingDisabled state and there was no error from Karpenter controller. All the workloads on the node were running and healthy (no single pod was terminating). I couldn’t even delete the node manually probably because of Karpenter’s finalizer.

Steps to Reproduce the Problem

deploy an istio ingressgateway deployment with nodeAffinity: (currently we are running only on amd64 nodes)

spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - preference:
          matchExpressions:
          - key: kubernetes.io/arch
            operator: In
            values:
            - amd64
        weight: 2
      - preference:
          matchExpressions:
          - key: kubernetes.io/arch
            operator: In
            values:
            - arm64
        weight: 2
      - preference:
          matchExpressions:
          - key: kubernetes.io/arch
            operator: In
            values:
            - ppc64le
        weight: 2
      - preference:
          matchExpressions:
          - key: kubernetes.io/arch
            operator: In
            values:
            - s390x
        weight: 2
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/arch
            operator: In
            values:
            - amd64
            - arm64
            - ppc64le
            - s390x
  • multiple nodes expire at the same time (or manually delete them via kubectl)
  • new nodes are created and some of the nodes are being removed
  • but the nodes that has istio-ingressgateway pod running on it will be stuck in Ready,SchedulingDisabled forever
  • once you manually delete the istio-ingressgateway pod the node will be deleted by Karpenter shortly

Resource Specs and Logs

This is the log for the Karpenter controller.

2022-02-24T18:29:51.573Z	INFO	Successfully created the logger.
2022-02-24T18:29:51.573Z	INFO	Logging level set to: debug
{"level":"info","ts":1645727391.6794684,"logger":"fallback","caller":"injection/injection.go:61","msg":"Starting informers..."}
2022-02-24T18:29:51.679Z	DEBUG	controller.aws	Using AWS region us-east-2	{"commit": "fd19ba2"}
2022-02-24T18:29:51.679Z	DEBUG	controller.aws.launchtemplate	Hydrating the launch template cache with names matching "Karpenter-eks-us-east-2-*"	{"commit": "fd19ba2"}
2022-02-24T18:29:51.818Z	DEBUG	controller.aws.launchtemplate	Finished hydrating the launch template cache with 0 items	{"commit": "fd19ba2"}
I0224 18:29:51.853041       1 leaderelection.go:243] attempting to acquire leader lease karpenter/karpenter-leader-election...
2022-02-24T18:29:51.853Z	INFO	controller	starting metrics server	{"commit": "fd19ba2", "path": "/metrics"}
I0224 19:19:20.394573       1 leaderelection.go:253] successfully acquired lease karpenter/karpenter-leader-election
2022-02-24T19:19:20.394Z	DEBUG	controller.events	Normal	{"commit": "fd19ba2", "object": {"kind":"ConfigMap","namespace":"karpenter","name":"karpenter-leader-election","uid":"4fdf1251-6e82-4570-af83-dc22b78d7597","apiVersion":"v1","resourceVersion":"377720835"}, "reason": "LeaderElection", "message": "karpenter-6d777cd8db-t27gz_54d7bbca-96a0-42ff-a9b8-c1daa60fa982 became leader"}
2022-02-24T19:19:20.394Z	DEBUG	controller.events	Normal	{"commit": "fd19ba2", "object": {"kind":"Lease","namespace":"karpenter","name":"karpenter-leader-election","uid":"1649a049-adcf-4d01-8556-8aa89466c6eb","apiVersion":"coordination.k8s.io/v1","resourceVersion":"377720836"}, "reason": "LeaderElection", "message": "karpenter-6d777cd8db-t27gz_54d7bbca-96a0-42ff-a9b8-c1daa60fa982 became leader"}
2022-02-24T19:19:20.394Z	INFO	controller.controller.counter	Starting EventSource	{"commit": "fd19ba2", "reconciler group": "karpenter.sh", "reconciler kind": "Provisioner", "source": "kind source: /, Kind="}
2022-02-24T19:19:20.394Z	INFO	controller.controller.counter	Starting EventSource	{"commit": "fd19ba2", "reconciler group": "karpenter.sh", "reconciler kind": "Provisioner", "source": "kind source: /, Kind="}
2022-02-24T19:19:20.394Z	INFO	controller.controller.counter	Starting Controller	{"commit": "fd19ba2", "reconciler group": "karpenter.sh", "reconciler kind": "Provisioner"}
2022-02-24T19:19:20.395Z	INFO	controller.controller.provisioning	Starting EventSource	{"commit": "fd19ba2", "reconciler group": "karpenter.sh", "reconciler kind": "Provisioner", "source": "kind source: /, Kind="}
2022-02-24T19:19:20.395Z	INFO	controller.controller.provisioning	Starting Controller	{"commit": "fd19ba2", "reconciler group": "karpenter.sh", "reconciler kind": "Provisioner"}
2022-02-24T19:19:20.395Z	INFO	controller.controller.volume	Starting EventSource	{"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "PersistentVolumeClaim", "source": "kind source: /, Kind="}
2022-02-24T19:19:20.395Z	INFO	controller.controller.volume	Starting EventSource	{"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "PersistentVolumeClaim", "source": "kind source: /, Kind="}
2022-02-24T19:19:20.395Z	INFO	controller.controller.volume	Starting Controller	{"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "PersistentVolumeClaim"}
2022-02-24T19:19:20.396Z	INFO	controller.controller.termination	Starting EventSource	{"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="}
2022-02-24T19:19:20.396Z	INFO	controller.controller.termination	Starting Controller	{"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Node"}
2022-02-24T19:19:20.396Z	INFO	controller.controller.node	Starting EventSource	{"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="}
2022-02-24T19:19:20.396Z	INFO	controller.controller.node	Starting EventSource	{"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="}
2022-02-24T19:19:20.396Z	INFO	controller.controller.node	Starting EventSource	{"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="}
2022-02-24T19:19:20.396Z	INFO	controller.controller.node	Starting Controller	{"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Node"}
2022-02-24T19:19:20.396Z	INFO	controller.controller.podmetrics	Starting EventSource	{"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Pod", "source": "kind source: /, Kind="}
2022-02-24T19:19:20.396Z	INFO	controller.controller.podmetrics	Starting Controller	{"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Pod"}
2022-02-24T19:19:20.396Z	INFO	controller.controller.nodemetrics	Starting EventSource	{"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="}
2022-02-24T19:19:20.396Z	INFO	controller.controller.nodemetrics	Starting EventSource	{"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="}
2022-02-24T19:19:20.396Z	INFO	controller.controller.nodemetrics	Starting EventSource	{"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="}
2022-02-24T19:19:20.396Z	INFO	controller.controller.nodemetrics	Starting Controller	{"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Node"}
2022-02-24T19:19:20.496Z	INFO	controller.controller.termination	Starting workers	{"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Node", "worker count": 10}
2022-02-24T19:19:20.519Z	INFO	controller.controller.podmetrics	Starting workers	{"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Pod", "worker count": 1}
2022-02-24T19:19:20.533Z	INFO	controller.controller.counter	Starting workers	{"commit": "fd19ba2", "reconciler group": "karpenter.sh", "reconciler kind": "Provisioner", "worker count": 10}
2022-02-24T19:19:20.537Z	INFO	controller.controller.volume	Starting workers	{"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "PersistentVolumeClaim", "worker count": 1}
2022-02-24T19:19:20.547Z	INFO	controller.controller.node	Starting workers	{"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Node", "worker count": 10}
2022-02-24T19:19:20.560Z	INFO	controller.controller.nodemetrics	Starting workers	{"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Node", "worker count": 1}
2022-02-24T19:19:20.612Z	INFO	controller.controller.provisioning	Starting workers	{"commit": "fd19ba2", "reconciler group": "karpenter.sh", "reconciler kind": "Provisioner", "worker count": 10}
2022-02-24T19:19:21.320Z	DEBUG	controller.provisioning	Discovered 318 EC2 instance types	{"commit": "fd19ba2", "provisioner": "prometheus"}
2022-02-24T19:19:21.374Z	DEBUG	controller.provisioning	Discovered 318 EC2 instance types	{"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:19:21.449Z	DEBUG	controller.provisioning	Discovered 318 EC2 instance types	{"commit": "fd19ba2", "provisioner": "zk-regional"}
2022-02-24T19:19:21.461Z	DEBUG	controller.provisioning	Discovered subnets: [subnet-0aaf36297918baef7 (us-east-2c) subnet-0f98d8cd6c06030c0 (us-east-2b) subnet-0ea1719ab3a6416c5 (us-east-2a)]	{"commit": "fd19ba2", "provisioner": "prometheus"}
2022-02-24T19:19:21.472Z	DEBUG	controller.provisioning	Discovered subnets: [subnet-0aaf36297918baef7 (us-east-2c) subnet-0f98d8cd6c06030c0 (us-east-2b) subnet-0ea1719ab3a6416c5 (us-east-2a)]	{"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:19:21.484Z	DEBUG	controller.provisioning	Discovered subnets: [subnet-0aaf36297918baef7 (us-east-2c) subnet-0f98d8cd6c06030c0 (us-east-2b) subnet-0ea1719ab3a6416c5 (us-east-2a)]	{"commit": "fd19ba2", "provisioner": "zk-regional"}
2022-02-24T19:19:21.566Z	DEBUG	controller.provisioning	Discovered EC2 instance types zonal offerings	{"commit": "fd19ba2", "provisioner": "prometheus"}
2022-02-24T19:19:21.569Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "fd19ba2", "provisioner": "prometheus"}
2022-02-24T19:19:21.572Z	DEBUG	controller.provisioning	Discovered EC2 instance types zonal offerings	{"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:19:21.574Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:19:21.596Z	DEBUG	controller.provisioning	Discovered EC2 instance types zonal offerings	{"commit": "fd19ba2", "provisioner": "zk-regional"}
2022-02-24T19:19:21.599Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "fd19ba2", "provisioner": "zk-regional"}
2022-02-24T19:19:51.721Z	INFO	controller.termination	Cordoned node	{"commit": "fd19ba2", "node": "ip-10-142-15-211.us-east-2.compute.internal"}
2022-02-24T19:19:51.746Z	DEBUG	controller.eviction	Evicted pod istio-system/weaver-egressgateway-6b79dcb7bf-g86mc	{"commit": "fd19ba2"}
2022-02-24T19:19:53.125Z	INFO	controller.termination	Cordoned node	{"commit": "fd19ba2", "node": "ip-10-142-15-25.us-east-2.compute.internal"}
2022-02-24T19:19:53.143Z	DEBUG	controller.eviction	Evicted pod istio-system/weaver-egressgateway-6b79dcb7bf-fsv62	{"commit": "fd19ba2"}
2022-02-24T19:19:53.171Z	DEBUG	controller.eviction	Evicted pod istio-system/istio-egressgateway-5b9b5bb74-qhc4l	{"commit": "fd19ba2"}
2022-02-24T19:19:54.281Z	INFO	controller.termination	Cordoned node	{"commit": "fd19ba2", "node": "ip-10-142-13-70.us-east-2.compute.internal"}
2022-02-24T19:19:54.303Z	DEBUG	controller.eviction	Evicted pod karpenter/inflate-5549549d89-b5q89	{"commit": "fd19ba2"}
2022-02-24T19:19:54.325Z	DEBUG	controller.eviction	Evicted pod karpenter/inflate-5549549d89-czrv5	{"commit": "fd19ba2"}
2022-02-24T19:19:54.347Z	DEBUG	controller.eviction	Evicted pod karpenter/inflate-5549549d89-dpdrj	{"commit": "fd19ba2"}
2022-02-24T19:19:54.372Z	DEBUG	controller.eviction	Evicted pod karpenter/inflate-5549549d89-wg744	{"commit": "fd19ba2"}
2022-02-24T19:19:54.395Z	DEBUG	controller.eviction	Evicted pod karpenter/inflate-5549549d89-gzcjh	{"commit": "fd19ba2"}
2022-02-24T19:19:55.226Z	INFO	controller.termination	Cordoned node	{"commit": "fd19ba2", "node": "ip-10-142-15-14.us-east-2.compute.internal"}
2022-02-24T19:19:55.245Z	DEBUG	controller.eviction	Evicted pod istio-system/istio-egressgateway-5b9b5bb74-pn2fc	{"commit": "fd19ba2"}
2022-02-24T19:19:55.269Z	DEBUG	controller.eviction	Evicted pod istio-system/istiod-6485d7d6f6-6p24z	{"commit": "fd19ba2"}
2022-02-24T19:19:55.302Z	DEBUG	controller.eviction	Evicted pod istio-system/weaver-egressgateway-6b79dcb7bf-r8vw7	{"commit": "fd19ba2"}
2022-02-24T19:19:55.357Z	DEBUG	controller.eviction	Evicted pod kube-system/calico-typha-horizontal-autoscaler-7cfc46f454-fvjc9	{"commit": "fd19ba2"}
2022-02-24T19:19:55.373Z	DEBUG	controller.eviction	Evicted pod kube-system/coredns-58c7b8dcf7-xjzx6	{"commit": "fd19ba2"}
2022-02-24T19:19:55.411Z	INFO	controller.provisioning	Batched 5 pods in 1.086154801s	{"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:19:55.421Z	DEBUG	controller.eviction	Evicted pod kube-system/ebs-csi-controller-6d8b4cd9f4-jnqk4	{"commit": "fd19ba2"}
2022-02-24T19:19:55.439Z	DEBUG	controller.eviction	Evicted pod kube-system/aws-load-balancer-controller-7bf6b99ddd-vnf5f	{"commit": "fd19ba2"}
2022-02-24T19:19:55.521Z	INFO	controller.provisioning	Computed packing of 1 node(s) for 5 pod(s) with instance type option(s) [m5.4xlarge]	{"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:19:56.180Z	INFO	controller.termination	Cordoned node	{"commit": "fd19ba2", "node": "ip-10-142-15-174.us-east-2.compute.internal"}
2022-02-24T19:19:56.195Z	DEBUG	controller.eviction	Evicted pod istio-system/weaver-egressgateway-6b79dcb7bf-knc7x	{"commit": "fd19ba2"}
2022-02-24T19:19:56.217Z	DEBUG	controller.eviction	Evicted pod karpenter/inflate-5549549d89-pbdcx	{"commit": "fd19ba2"}
2022-02-24T19:19:56.251Z	DEBUG	controller.eviction	Evicted pod karpenter/inflate-5549549d89-4gxt6	{"commit": "fd19ba2"}
2022-02-24T19:19:56.286Z	DEBUG	controller.eviction	Evicted pod karpenter/inflate-5549549d89-b8w4p	{"commit": "fd19ba2"}
2022-02-24T19:19:56.329Z	DEBUG	controller.eviction	Evicted pod karpenter/inflate-5549549d89-pj96f	{"commit": "fd19ba2"}
2022-02-24T19:19:56.357Z	DEBUG	controller.eviction	Evicted pod karpenter/inflate-5549549d89-mhdrw	{"commit": "fd19ba2"}
2022-02-24T19:19:57.404Z	INFO	controller.provisioning	Launched instance: i-09dd8f13622e427a0, hostname: ip-10-142-13-67.us-east-2.compute.internal, type: m5.4xlarge, zone: us-east-2a, capacityType: on-demand	{"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:19:57.423Z	INFO	controller.provisioning	Bound 5 pod(s) to node ip-10-142-13-67.us-east-2.compute.internal	{"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:19:57.424Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:19:58.424Z	INFO	controller.provisioning	Batched 4 pods in 1.000289324s	{"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:19:58.426Z	INFO	controller.provisioning	Waiting for unschedulable pods	{"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:20:07.562Z	INFO	controller.termination	Deleted node	{"commit": "fd19ba2", "node": "ip-10-142-15-14.us-east-2.compute.internal"}
2022-02-24T19:20:28.478Z	INFO	controller.termination	Deleted node	{"commit": "fd19ba2", "node": "ip-10-142-15-174.us-east-2.compute.internal"}
2022-02-24T19:20:28.557Z	DEBUG	controller.provisioning	Discovered subnets: [subnet-0aaf36297918baef7 (us-east-2c) subnet-0f98d8cd6c06030c0 (us-east-2b) subnet-0ea1719ab3a6416c5 (us-east-2a)]	{"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:24:22.261Z	DEBUG	controller.provisioning	Discovered 318 EC2 instance types	{"commit": "fd19ba2", "provisioner": "prometheus"}
2022-02-24T19:24:22.329Z	DEBUG	controller.provisioning	Discovered 318 EC2 instance types	{"commit": "fd19ba2", "provisioner": "zk-regional"}
2022-02-24T19:24:22.355Z	DEBUG	controller.provisioning	Discovered subnets: [subnet-0aaf36297918baef7 (us-east-2c) subnet-0f98d8cd6c06030c0 (us-east-2b) subnet-0ea1719ab3a6416c5 (us-east-2a)]	{"commit": "fd19ba2", "provisioner": "prometheus"}
2022-02-24T19:24:22.366Z	DEBUG	controller.provisioning	Discovered subnets: [subnet-0aaf36297918baef7 (us-east-2c) subnet-0f98d8cd6c06030c0 (us-east-2b) subnet-0ea1719ab3a6416c5 (us-east-2a)]	{"commit": "fd19ba2", "provisioner": "zk-regional"}
2022-02-24T19:24:22.389Z	DEBUG	controller.provisioning	Discovered 318 EC2 instance types	{"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:24:22.475Z	DEBUG	controller.provisioning	Discovered EC2 instance types zonal offerings	{"commit": "fd19ba2", "provisioner": "zk-regional"}
2022-02-24T19:24:22.481Z	DEBUG	controller.provisioning	Discovered EC2 instance types zonal offerings	{"commit": "fd19ba2", "provisioner": "prometheus"}
2022-02-24T19:24:22.494Z	DEBUG	controller.provisioning	Discovered EC2 instance types zonal offerings	{"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:29:23.200Z	DEBUG	controller.provisioning	Discovered 318 EC2 instance types	{"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:29:23.266Z	DEBUG	controller.provisioning	Discovered 318 EC2 instance types	{"commit": "fd19ba2", "provisioner": "zk-regional"}
2022-02-24T19:29:23.276Z	DEBUG	controller.provisioning	Discovered subnets: [subnet-0aaf36297918baef7 (us-east-2c) subnet-0f98d8cd6c06030c0 (us-east-2b) subnet-0ea1719ab3a6416c5 (us-east-2a)]	{"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:29:23.286Z	DEBUG	controller.provisioning	Discovered 318 EC2 instance types	{"commit": "fd19ba2", "provisioner": "prometheus"}
2022-02-24T19:29:23.301Z	DEBUG	controller.provisioning	Discovered subnets: [subnet-0aaf36297918baef7 (us-east-2c) subnet-0f98d8cd6c06030c0 (us-east-2b) subnet-0ea1719ab3a6416c5 (us-east-2a)]	{"commit": "fd19ba2", "provisioner": "zk-regional"}
2022-02-24T19:29:23.368Z	DEBUG	controller.provisioning	Discovered EC2 instance types zonal offerings	{"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:29:23.393Z	DEBUG	controller.provisioning	Discovered EC2 instance types zonal offerings	{"commit": "fd19ba2", "provisioner": "zk-regional"}
2022-02-24T19:29:23.420Z	DEBUG	controller.provisioning	Discovered EC2 instance types zonal offerings	{"commit": "fd19ba2", "provisioner": "prometheus"}
### This happens after I manually delete `istio-ingressgateway-db9cf4489-tjfj2`
2022-02-24T19:30:05.305Z	DEBUG	controller.eviction	Evicted pod kube-system/coredns-58c7b8dcf7-msjts	{"commit": "fd19ba2"}
2022-02-24T19:30:06.538Z	DEBUG	controller.eviction	Evicted pod istio-system/istio-ingressgateway-db9cf4489-tjfj2	{"commit": "fd19ba2"}
2022-02-24T19:30:15.672Z	INFO	controller.termination	Deleted node	{"commit": "fd19ba2", "node": "ip-10-142-15-25.us-east-2.compute.internal"}

Screenshot 2022-02-24 at 2 45 40 PM Screenshot 2022-02-24 at 2 45 19 PM

Not sure if this is related to https://github.com/aws/karpenter/issues/1166

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 2
  • Comments: 21 (14 by maintainers)

Most upvoted comments

Closing this as the issue looks solved. Feel free to open if you see this still!

It seems there is a bug in the eviction API where errors returned due to duplicate PDBs exclude the “reason” field.
https://github.com/kubernetes/kubernetes/blob/v1.21.5/pkg/registry/core/pod/storage/eviction.go#L194-L198

This is important as the method we use to determine the type of error relies on the “reason” field being populated. https://github.com/aws/karpenter/blob/main/pkg/controllers/termination/eviction.go#L94-L96 https://github.com/kubernetes/apimachinery/blob/master/pkg/api/errors/errors.go#L711

We are planning to implement a fix in Karpenter which works around this bug (via #1432 ), but will also pursue a fix upstream.

Thanks for the info @nandiheath . It does indeed seem there is something going on with the PDB, but as you suggested, more logging surrounding pod eviction would be helpful in determining root cause.

I’m working on a fix which will provide additional logging. Perhaps we can continue troubleshooting once the fix has been released.