karpenter-provider-aws: Expired nodes are stuck in `Ready,SchedulingDisabled` without any error
Version
Karpenter: v0.6.3 Kubernetes: v1.21.5
Expected Behavior
Nodes provisioned by Karpenter with expiry should be deleted after they are expired.
Actual Behavior
The nodes are stuck in Ready,SchedulingDisabled
state and there was no error from Karpenter controller.
All the workloads on the node were running and healthy (no single pod was terminating).
I couldn’t even delete the node manually probably because of Karpenter’s finalizer.
Steps to Reproduce the Problem
deploy an istio ingressgateway deployment with nodeAffinity: (currently we are running only on amd64 nodes)
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: kubernetes.io/arch
operator: In
values:
- amd64
weight: 2
- preference:
matchExpressions:
- key: kubernetes.io/arch
operator: In
values:
- arm64
weight: 2
- preference:
matchExpressions:
- key: kubernetes.io/arch
operator: In
values:
- ppc64le
weight: 2
- preference:
matchExpressions:
- key: kubernetes.io/arch
operator: In
values:
- s390x
weight: 2
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/arch
operator: In
values:
- amd64
- arm64
- ppc64le
- s390x
- multiple nodes expire at the same time (or manually delete them via
kubectl
) - new nodes are created and some of the nodes are being removed
- but the nodes that has
istio-ingressgateway
pod running on it will be stuck inReady,SchedulingDisabled
forever - once you manually delete the
istio-ingressgateway
pod the node will be deleted by Karpenter shortly
Resource Specs and Logs
This is the log for the Karpenter controller.
2022-02-24T18:29:51.573Z INFO Successfully created the logger.
2022-02-24T18:29:51.573Z INFO Logging level set to: debug
{"level":"info","ts":1645727391.6794684,"logger":"fallback","caller":"injection/injection.go:61","msg":"Starting informers..."}
2022-02-24T18:29:51.679Z DEBUG controller.aws Using AWS region us-east-2 {"commit": "fd19ba2"}
2022-02-24T18:29:51.679Z DEBUG controller.aws.launchtemplate Hydrating the launch template cache with names matching "Karpenter-eks-us-east-2-*" {"commit": "fd19ba2"}
2022-02-24T18:29:51.818Z DEBUG controller.aws.launchtemplate Finished hydrating the launch template cache with 0 items {"commit": "fd19ba2"}
I0224 18:29:51.853041 1 leaderelection.go:243] attempting to acquire leader lease karpenter/karpenter-leader-election...
2022-02-24T18:29:51.853Z INFO controller starting metrics server {"commit": "fd19ba2", "path": "/metrics"}
I0224 19:19:20.394573 1 leaderelection.go:253] successfully acquired lease karpenter/karpenter-leader-election
2022-02-24T19:19:20.394Z DEBUG controller.events Normal {"commit": "fd19ba2", "object": {"kind":"ConfigMap","namespace":"karpenter","name":"karpenter-leader-election","uid":"4fdf1251-6e82-4570-af83-dc22b78d7597","apiVersion":"v1","resourceVersion":"377720835"}, "reason": "LeaderElection", "message": "karpenter-6d777cd8db-t27gz_54d7bbca-96a0-42ff-a9b8-c1daa60fa982 became leader"}
2022-02-24T19:19:20.394Z DEBUG controller.events Normal {"commit": "fd19ba2", "object": {"kind":"Lease","namespace":"karpenter","name":"karpenter-leader-election","uid":"1649a049-adcf-4d01-8556-8aa89466c6eb","apiVersion":"coordination.k8s.io/v1","resourceVersion":"377720836"}, "reason": "LeaderElection", "message": "karpenter-6d777cd8db-t27gz_54d7bbca-96a0-42ff-a9b8-c1daa60fa982 became leader"}
2022-02-24T19:19:20.394Z INFO controller.controller.counter Starting EventSource {"commit": "fd19ba2", "reconciler group": "karpenter.sh", "reconciler kind": "Provisioner", "source": "kind source: /, Kind="}
2022-02-24T19:19:20.394Z INFO controller.controller.counter Starting EventSource {"commit": "fd19ba2", "reconciler group": "karpenter.sh", "reconciler kind": "Provisioner", "source": "kind source: /, Kind="}
2022-02-24T19:19:20.394Z INFO controller.controller.counter Starting Controller {"commit": "fd19ba2", "reconciler group": "karpenter.sh", "reconciler kind": "Provisioner"}
2022-02-24T19:19:20.395Z INFO controller.controller.provisioning Starting EventSource {"commit": "fd19ba2", "reconciler group": "karpenter.sh", "reconciler kind": "Provisioner", "source": "kind source: /, Kind="}
2022-02-24T19:19:20.395Z INFO controller.controller.provisioning Starting Controller {"commit": "fd19ba2", "reconciler group": "karpenter.sh", "reconciler kind": "Provisioner"}
2022-02-24T19:19:20.395Z INFO controller.controller.volume Starting EventSource {"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "PersistentVolumeClaim", "source": "kind source: /, Kind="}
2022-02-24T19:19:20.395Z INFO controller.controller.volume Starting EventSource {"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "PersistentVolumeClaim", "source": "kind source: /, Kind="}
2022-02-24T19:19:20.395Z INFO controller.controller.volume Starting Controller {"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "PersistentVolumeClaim"}
2022-02-24T19:19:20.396Z INFO controller.controller.termination Starting EventSource {"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="}
2022-02-24T19:19:20.396Z INFO controller.controller.termination Starting Controller {"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Node"}
2022-02-24T19:19:20.396Z INFO controller.controller.node Starting EventSource {"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="}
2022-02-24T19:19:20.396Z INFO controller.controller.node Starting EventSource {"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="}
2022-02-24T19:19:20.396Z INFO controller.controller.node Starting EventSource {"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="}
2022-02-24T19:19:20.396Z INFO controller.controller.node Starting Controller {"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Node"}
2022-02-24T19:19:20.396Z INFO controller.controller.podmetrics Starting EventSource {"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Pod", "source": "kind source: /, Kind="}
2022-02-24T19:19:20.396Z INFO controller.controller.podmetrics Starting Controller {"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Pod"}
2022-02-24T19:19:20.396Z INFO controller.controller.nodemetrics Starting EventSource {"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="}
2022-02-24T19:19:20.396Z INFO controller.controller.nodemetrics Starting EventSource {"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="}
2022-02-24T19:19:20.396Z INFO controller.controller.nodemetrics Starting EventSource {"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Node", "source": "kind source: /, Kind="}
2022-02-24T19:19:20.396Z INFO controller.controller.nodemetrics Starting Controller {"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Node"}
2022-02-24T19:19:20.496Z INFO controller.controller.termination Starting workers {"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Node", "worker count": 10}
2022-02-24T19:19:20.519Z INFO controller.controller.podmetrics Starting workers {"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Pod", "worker count": 1}
2022-02-24T19:19:20.533Z INFO controller.controller.counter Starting workers {"commit": "fd19ba2", "reconciler group": "karpenter.sh", "reconciler kind": "Provisioner", "worker count": 10}
2022-02-24T19:19:20.537Z INFO controller.controller.volume Starting workers {"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "PersistentVolumeClaim", "worker count": 1}
2022-02-24T19:19:20.547Z INFO controller.controller.node Starting workers {"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Node", "worker count": 10}
2022-02-24T19:19:20.560Z INFO controller.controller.nodemetrics Starting workers {"commit": "fd19ba2", "reconciler group": "", "reconciler kind": "Node", "worker count": 1}
2022-02-24T19:19:20.612Z INFO controller.controller.provisioning Starting workers {"commit": "fd19ba2", "reconciler group": "karpenter.sh", "reconciler kind": "Provisioner", "worker count": 10}
2022-02-24T19:19:21.320Z DEBUG controller.provisioning Discovered 318 EC2 instance types {"commit": "fd19ba2", "provisioner": "prometheus"}
2022-02-24T19:19:21.374Z DEBUG controller.provisioning Discovered 318 EC2 instance types {"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:19:21.449Z DEBUG controller.provisioning Discovered 318 EC2 instance types {"commit": "fd19ba2", "provisioner": "zk-regional"}
2022-02-24T19:19:21.461Z DEBUG controller.provisioning Discovered subnets: [subnet-0aaf36297918baef7 (us-east-2c) subnet-0f98d8cd6c06030c0 (us-east-2b) subnet-0ea1719ab3a6416c5 (us-east-2a)] {"commit": "fd19ba2", "provisioner": "prometheus"}
2022-02-24T19:19:21.472Z DEBUG controller.provisioning Discovered subnets: [subnet-0aaf36297918baef7 (us-east-2c) subnet-0f98d8cd6c06030c0 (us-east-2b) subnet-0ea1719ab3a6416c5 (us-east-2a)] {"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:19:21.484Z DEBUG controller.provisioning Discovered subnets: [subnet-0aaf36297918baef7 (us-east-2c) subnet-0f98d8cd6c06030c0 (us-east-2b) subnet-0ea1719ab3a6416c5 (us-east-2a)] {"commit": "fd19ba2", "provisioner": "zk-regional"}
2022-02-24T19:19:21.566Z DEBUG controller.provisioning Discovered EC2 instance types zonal offerings {"commit": "fd19ba2", "provisioner": "prometheus"}
2022-02-24T19:19:21.569Z INFO controller.provisioning Waiting for unschedulable pods {"commit": "fd19ba2", "provisioner": "prometheus"}
2022-02-24T19:19:21.572Z DEBUG controller.provisioning Discovered EC2 instance types zonal offerings {"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:19:21.574Z INFO controller.provisioning Waiting for unschedulable pods {"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:19:21.596Z DEBUG controller.provisioning Discovered EC2 instance types zonal offerings {"commit": "fd19ba2", "provisioner": "zk-regional"}
2022-02-24T19:19:21.599Z INFO controller.provisioning Waiting for unschedulable pods {"commit": "fd19ba2", "provisioner": "zk-regional"}
2022-02-24T19:19:51.721Z INFO controller.termination Cordoned node {"commit": "fd19ba2", "node": "ip-10-142-15-211.us-east-2.compute.internal"}
2022-02-24T19:19:51.746Z DEBUG controller.eviction Evicted pod istio-system/weaver-egressgateway-6b79dcb7bf-g86mc {"commit": "fd19ba2"}
2022-02-24T19:19:53.125Z INFO controller.termination Cordoned node {"commit": "fd19ba2", "node": "ip-10-142-15-25.us-east-2.compute.internal"}
2022-02-24T19:19:53.143Z DEBUG controller.eviction Evicted pod istio-system/weaver-egressgateway-6b79dcb7bf-fsv62 {"commit": "fd19ba2"}
2022-02-24T19:19:53.171Z DEBUG controller.eviction Evicted pod istio-system/istio-egressgateway-5b9b5bb74-qhc4l {"commit": "fd19ba2"}
2022-02-24T19:19:54.281Z INFO controller.termination Cordoned node {"commit": "fd19ba2", "node": "ip-10-142-13-70.us-east-2.compute.internal"}
2022-02-24T19:19:54.303Z DEBUG controller.eviction Evicted pod karpenter/inflate-5549549d89-b5q89 {"commit": "fd19ba2"}
2022-02-24T19:19:54.325Z DEBUG controller.eviction Evicted pod karpenter/inflate-5549549d89-czrv5 {"commit": "fd19ba2"}
2022-02-24T19:19:54.347Z DEBUG controller.eviction Evicted pod karpenter/inflate-5549549d89-dpdrj {"commit": "fd19ba2"}
2022-02-24T19:19:54.372Z DEBUG controller.eviction Evicted pod karpenter/inflate-5549549d89-wg744 {"commit": "fd19ba2"}
2022-02-24T19:19:54.395Z DEBUG controller.eviction Evicted pod karpenter/inflate-5549549d89-gzcjh {"commit": "fd19ba2"}
2022-02-24T19:19:55.226Z INFO controller.termination Cordoned node {"commit": "fd19ba2", "node": "ip-10-142-15-14.us-east-2.compute.internal"}
2022-02-24T19:19:55.245Z DEBUG controller.eviction Evicted pod istio-system/istio-egressgateway-5b9b5bb74-pn2fc {"commit": "fd19ba2"}
2022-02-24T19:19:55.269Z DEBUG controller.eviction Evicted pod istio-system/istiod-6485d7d6f6-6p24z {"commit": "fd19ba2"}
2022-02-24T19:19:55.302Z DEBUG controller.eviction Evicted pod istio-system/weaver-egressgateway-6b79dcb7bf-r8vw7 {"commit": "fd19ba2"}
2022-02-24T19:19:55.357Z DEBUG controller.eviction Evicted pod kube-system/calico-typha-horizontal-autoscaler-7cfc46f454-fvjc9 {"commit": "fd19ba2"}
2022-02-24T19:19:55.373Z DEBUG controller.eviction Evicted pod kube-system/coredns-58c7b8dcf7-xjzx6 {"commit": "fd19ba2"}
2022-02-24T19:19:55.411Z INFO controller.provisioning Batched 5 pods in 1.086154801s {"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:19:55.421Z DEBUG controller.eviction Evicted pod kube-system/ebs-csi-controller-6d8b4cd9f4-jnqk4 {"commit": "fd19ba2"}
2022-02-24T19:19:55.439Z DEBUG controller.eviction Evicted pod kube-system/aws-load-balancer-controller-7bf6b99ddd-vnf5f {"commit": "fd19ba2"}
2022-02-24T19:19:55.521Z INFO controller.provisioning Computed packing of 1 node(s) for 5 pod(s) with instance type option(s) [m5.4xlarge] {"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:19:56.180Z INFO controller.termination Cordoned node {"commit": "fd19ba2", "node": "ip-10-142-15-174.us-east-2.compute.internal"}
2022-02-24T19:19:56.195Z DEBUG controller.eviction Evicted pod istio-system/weaver-egressgateway-6b79dcb7bf-knc7x {"commit": "fd19ba2"}
2022-02-24T19:19:56.217Z DEBUG controller.eviction Evicted pod karpenter/inflate-5549549d89-pbdcx {"commit": "fd19ba2"}
2022-02-24T19:19:56.251Z DEBUG controller.eviction Evicted pod karpenter/inflate-5549549d89-4gxt6 {"commit": "fd19ba2"}
2022-02-24T19:19:56.286Z DEBUG controller.eviction Evicted pod karpenter/inflate-5549549d89-b8w4p {"commit": "fd19ba2"}
2022-02-24T19:19:56.329Z DEBUG controller.eviction Evicted pod karpenter/inflate-5549549d89-pj96f {"commit": "fd19ba2"}
2022-02-24T19:19:56.357Z DEBUG controller.eviction Evicted pod karpenter/inflate-5549549d89-mhdrw {"commit": "fd19ba2"}
2022-02-24T19:19:57.404Z INFO controller.provisioning Launched instance: i-09dd8f13622e427a0, hostname: ip-10-142-13-67.us-east-2.compute.internal, type: m5.4xlarge, zone: us-east-2a, capacityType: on-demand {"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:19:57.423Z INFO controller.provisioning Bound 5 pod(s) to node ip-10-142-13-67.us-east-2.compute.internal {"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:19:57.424Z INFO controller.provisioning Waiting for unschedulable pods {"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:19:58.424Z INFO controller.provisioning Batched 4 pods in 1.000289324s {"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:19:58.426Z INFO controller.provisioning Waiting for unschedulable pods {"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:20:07.562Z INFO controller.termination Deleted node {"commit": "fd19ba2", "node": "ip-10-142-15-14.us-east-2.compute.internal"}
2022-02-24T19:20:28.478Z INFO controller.termination Deleted node {"commit": "fd19ba2", "node": "ip-10-142-15-174.us-east-2.compute.internal"}
2022-02-24T19:20:28.557Z DEBUG controller.provisioning Discovered subnets: [subnet-0aaf36297918baef7 (us-east-2c) subnet-0f98d8cd6c06030c0 (us-east-2b) subnet-0ea1719ab3a6416c5 (us-east-2a)] {"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:24:22.261Z DEBUG controller.provisioning Discovered 318 EC2 instance types {"commit": "fd19ba2", "provisioner": "prometheus"}
2022-02-24T19:24:22.329Z DEBUG controller.provisioning Discovered 318 EC2 instance types {"commit": "fd19ba2", "provisioner": "zk-regional"}
2022-02-24T19:24:22.355Z DEBUG controller.provisioning Discovered subnets: [subnet-0aaf36297918baef7 (us-east-2c) subnet-0f98d8cd6c06030c0 (us-east-2b) subnet-0ea1719ab3a6416c5 (us-east-2a)] {"commit": "fd19ba2", "provisioner": "prometheus"}
2022-02-24T19:24:22.366Z DEBUG controller.provisioning Discovered subnets: [subnet-0aaf36297918baef7 (us-east-2c) subnet-0f98d8cd6c06030c0 (us-east-2b) subnet-0ea1719ab3a6416c5 (us-east-2a)] {"commit": "fd19ba2", "provisioner": "zk-regional"}
2022-02-24T19:24:22.389Z DEBUG controller.provisioning Discovered 318 EC2 instance types {"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:24:22.475Z DEBUG controller.provisioning Discovered EC2 instance types zonal offerings {"commit": "fd19ba2", "provisioner": "zk-regional"}
2022-02-24T19:24:22.481Z DEBUG controller.provisioning Discovered EC2 instance types zonal offerings {"commit": "fd19ba2", "provisioner": "prometheus"}
2022-02-24T19:24:22.494Z DEBUG controller.provisioning Discovered EC2 instance types zonal offerings {"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:29:23.200Z DEBUG controller.provisioning Discovered 318 EC2 instance types {"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:29:23.266Z DEBUG controller.provisioning Discovered 318 EC2 instance types {"commit": "fd19ba2", "provisioner": "zk-regional"}
2022-02-24T19:29:23.276Z DEBUG controller.provisioning Discovered subnets: [subnet-0aaf36297918baef7 (us-east-2c) subnet-0f98d8cd6c06030c0 (us-east-2b) subnet-0ea1719ab3a6416c5 (us-east-2a)] {"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:29:23.286Z DEBUG controller.provisioning Discovered 318 EC2 instance types {"commit": "fd19ba2", "provisioner": "prometheus"}
2022-02-24T19:29:23.301Z DEBUG controller.provisioning Discovered subnets: [subnet-0aaf36297918baef7 (us-east-2c) subnet-0f98d8cd6c06030c0 (us-east-2b) subnet-0ea1719ab3a6416c5 (us-east-2a)] {"commit": "fd19ba2", "provisioner": "zk-regional"}
2022-02-24T19:29:23.368Z DEBUG controller.provisioning Discovered EC2 instance types zonal offerings {"commit": "fd19ba2", "provisioner": "default"}
2022-02-24T19:29:23.393Z DEBUG controller.provisioning Discovered EC2 instance types zonal offerings {"commit": "fd19ba2", "provisioner": "zk-regional"}
2022-02-24T19:29:23.420Z DEBUG controller.provisioning Discovered EC2 instance types zonal offerings {"commit": "fd19ba2", "provisioner": "prometheus"}
### This happens after I manually delete `istio-ingressgateway-db9cf4489-tjfj2`
2022-02-24T19:30:05.305Z DEBUG controller.eviction Evicted pod kube-system/coredns-58c7b8dcf7-msjts {"commit": "fd19ba2"}
2022-02-24T19:30:06.538Z DEBUG controller.eviction Evicted pod istio-system/istio-ingressgateway-db9cf4489-tjfj2 {"commit": "fd19ba2"}
2022-02-24T19:30:15.672Z INFO controller.termination Deleted node {"commit": "fd19ba2", "node": "ip-10-142-15-25.us-east-2.compute.internal"}
Not sure if this is related to https://github.com/aws/karpenter/issues/1166
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 2
- Comments: 21 (14 by maintainers)
Closing this as the issue looks solved. Feel free to open if you see this still!
It seems there is a bug in the eviction API where errors returned due to duplicate PDBs exclude the “reason” field.
https://github.com/kubernetes/kubernetes/blob/v1.21.5/pkg/registry/core/pod/storage/eviction.go#L194-L198
This is important as the method we use to determine the type of error relies on the “reason” field being populated. https://github.com/aws/karpenter/blob/main/pkg/controllers/termination/eviction.go#L94-L96 https://github.com/kubernetes/apimachinery/blob/master/pkg/api/errors/errors.go#L711
We are planning to implement a fix in Karpenter which works around this bug (via #1432 ), but will also pursue a fix upstream.
Thanks for the info @nandiheath . It does indeed seem there is something going on with the PDB, but as you suggested, more logging surrounding pod eviction would be helpful in determining root cause.
I’m working on a fix which will provide additional logging. Perhaps we can continue troubleshooting once the fix has been released.