karpenter-provider-aws: very slow Karpenter consolidation

Version

Karpenter Version: v0.27.2

Kubernetes Version: v1.25.0

Expected Behavior

We have peak hours where hundreds of nodes are being provisioned.
In a matter of hours, most of the workload (deployments) is scaled down or deleted.
We expect rapid decrease in node count using the consolidation method.

Actual Behavior

We see very slow consolidation. When we restart the Karpenter pods, we see Karpenter immediately starts working as expected.

We first witnessed this issue on April 17th, we did not understand what is going on so we restarted Karpenter at around 7am, and then we saw a decrease in node count. After investigating, we came across https://github.com/aws/karpenter/issues/3576 which suggested we should upgrade to 0.27.2 to avoid this issue, and so we did. April 18th looked fantastic, Karpenter started consolidation at 4am and by the time we started working at 7am the node count was great and we declared this issue as resolved. On April 19th and today (April 20th) we woke up to see we have many nodes, we restarted Karpenter and again the consolidation started working again.

All of this can be witnessed here: Screenshot from 2023-04-20 10-58-04

In comparison to the overall number of pods in the cluster, we can see that the peak starts at 1:30am (UTC+3). We have 15k pods running at peak. by 5am we have less than 5k pods, which means that by 5:30am we should have much less nodes on the cluster: Screenshot from 2023-04-20 11-29-16

Steps to Reproduce the Problem

I have the full log of the Karpenter controller before we restarted the deployments. While it does look like the controller.deprovision is doing some work, it is very slow to initiate the process and we expect many more nodes to be already gone by this point:

2023-04-20T04:42:37.773Z	INFO	controller.deprovisioning	deprovisioning via consolidation delete, terminating 8 machines ip-192-168-1-180.us-east-2.compute.internal/c5a.xlarge/spot, ip-192-168-1-48.us-east-2.compute.internal/c5a.xlarge/spot, ip-192-168-1-240.us-east-2.compute.internal/c5a.xlarge/spot, ip-192-168-2-74.us-east-2.compute.internal/c6a.2xlarge/spot, ip-192-168-1-74.us-east-2.compute.internal/c5a.xlarge/spot, ip-192-168-1-39.us-east-2.compute.internal/c5a.xlarge/spot, ip-192-168-1-173.us-east-2.compute.internal/c5a.xlarge/spot, ip-192-168-1-69.us-east-2.compute.internal/c5a.xlarge/spot	{"commit": "d01ea11-dirty"}
2023-04-20T04:42:37.785Z	INFO	controller.termination	cordoned node	{"commit": "d01ea11-dirty", "node": "ip-192-168-1-180.us-east-2.compute.internal"}
2023-04-20T04:42:37.795Z	INFO	controller.termination	cordoned node	{"commit": "d01ea11-dirty", "node": "ip-192-168-1-48.us-east-2.compute.internal"}
2023-04-20T04:42:37.804Z	INFO	controller.termination	cordoned node	{"commit": "d01ea11-dirty", "node": "ip-192-168-1-240.us-east-2.compute.internal"}
2023-04-20T04:42:37.834Z	INFO	controller.termination	cordoned node	{"commit": "d01ea11-dirty", "node": "ip-192-168-2-74.us-east-2.compute.internal"}
2023-04-20T04:42:37.874Z	INFO	controller.termination	cordoned node	{"commit": "d01ea11-dirty", "node": "ip-192-168-1-74.us-east-2.compute.internal"}
2023-04-20T04:42:37.914Z	INFO	controller.termination	cordoned node	{"commit": "d01ea11-dirty", "node": "ip-192-168-1-39.us-east-2.compute.internal"}
2023-04-20T04:42:37.922Z	INFO	controller.termination	cordoned node	{"commit": "d01ea11-dirty", "node": "ip-192-168-1-173.us-east-2.compute.internal"}
2023-04-20T04:42:37.968Z	INFO	controller.termination	cordoned node	{"commit": "d01ea11-dirty", "node": "ip-192-168-1-69.us-east-2.compute.internal"}
2023-04-20T04:42:38.599Z	INFO	controller.termination	deleted node	{"commit": "d01ea11-dirty", "node": "ip-192-168-1-69.us-east-2.compute.internal"}
2023-04-20T04:42:38.602Z	INFO	controller.termination	deleted node	{"commit": "d01ea11-dirty", "node": "ip-192-168-1-173.us-east-2.compute.internal"}
2023-04-20T04:42:38.606Z	INFO	controller.termination	deleted node	{"commit": "d01ea11-dirty", "node": "ip-192-168-2-74.us-east-2.compute.internal"}
2023-04-20T04:42:38.608Z	INFO	controller.termination	deleted node	{"commit": "d01ea11-dirty", "node": "ip-192-168-1-180.us-east-2.compute.internal"}
2023-04-20T04:42:38.611Z	INFO	controller.termination	deleted node	{"commit": "d01ea11-dirty", "node": "ip-192-168-1-48.us-east-2.compute.internal"}
2023-04-20T04:42:44.040Z	INFO	controller.termination	deleted node	{"commit": "d01ea11-dirty", "node": "ip-192-168-1-39.us-east-2.compute.internal"}
2023-04-20T04:42:44.371Z	INFO	controller.termination	deleted node	{"commit": "d01ea11-dirty", "node": "ip-192-168-1-240.us-east-2.compute.internal"}
2023-04-20T04:42:50.385Z	INFO	controller.termination	deleted node	{"commit": "d01ea11-dirty", "node": "ip-192-168-1-74.us-east-2.compute.internal"}
2023-04-20T04:43:07.684Z	INFO	controller.deprovisioning	deprovisioning via consolidation delete, terminating 3 machines ip-192-168-1-132.us-east-2.compute.internal/c6a.2xlarge/spot, ip-192-168-1-19.us-east-2.compute.internal/c5a.xlarge/spot, ip-192-168-1-30.us-east-2.compute.internal/c5a.xlarge/spot	{"commit": "d01ea11-dirty"}
2023-04-20T04:43:07.696Z	INFO	controller.termination	cordoned node	{"commit": "d01ea11-dirty", "node": "ip-192-168-1-132.us-east-2.compute.internal"}
2023-04-20T04:43:07.707Z	INFO	controller.termination	cordoned node	{"commit": "d01ea11-dirty", "node": "ip-192-168-1-19.us-east-2.compute.internal"}
2023-04-20T04:43:07.718Z	INFO	controller.termination	cordoned node	{"commit": "d01ea11-dirty", "node": "ip-192-168-1-30.us-east-2.compute.internal"}
2023-04-20T04:43:08.099Z	INFO	controller.termination	deleted node	{"commit": "d01ea11-dirty", "node": "ip-192-168-1-19.us-east-2.compute.internal"}
2023-04-20T04:43:08.101Z	INFO	controller.termination	deleted node	{"commit": "d01ea11-dirty", "node": "ip-192-168-1-132.us-east-2.compute.internal"}
2023-04-20T04:43:20.260Z	INFO	controller.termination	deleted node	{"commit": "d01ea11-dirty", "node": "ip-192-168-1-30.us-east-2.compute.internal"}
2023-04-20T04:43:39.233Z	INFO	controller.deprovisioning	deprovisioning via consolidation delete, terminating 1 machines ip-192-168-1-239.us-east-2.compute.internal/c5a.xlarge/spot	{"commit": "d01ea11-dirty"}

Resource Specs and Logs

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"karpenter.sh/v1alpha5","kind":"Provisioner","metadata":{"annotations":{},"labels":{"app.kubernetes.io/instance":"prod"},"name":"scan-pipeline-node-provisioner"},"spec":{"consolidation":{"enabled":true},"kubeletConfiguration":{"maxPods":110},"labels":{"scan-pipeline-node":"true"},"limits":{"resources":{"cpu":"2400","memory":"4800Gi"}},"providerRef":{"name":"deafult-nodes-template"},"requirements":[{"key":"karpenter.sh/capacity-type","operator":"In","values":["spot"]},{"key":"node.kubernetes.io/instance-type","operator":"In","values":["c6a.2xlarge","c5a.2xlarge","c5.2xlarge","c6a.xlarge","c5a.xlarge","c5.xlarge"]},{"key":"topology.kubernetes.io/zone","operator":"In","values":["us-east-2a","us-east-2b"]},{"key":"kubernetes.io/arch","operator":"In","values":["arm64","amd64"]},{"key":"kubernetes.io/os","operator":"In","values":["linux"]}],"taints":[{"effect":"NoSchedule","key":"scan-pipeline-node"}]}}
  creationTimestamp: "2023-01-08T08:57:34Z"
  generation: 17
  labels:
    app.kubernetes.io/instance: prod
  name: scan-pipeline-node-provisioner
  resourceVersion: "188490435"
  uid: fce0308f-5ef9-4e6a-8885-a547e7b3ce83
spec:
  consolidation:
    enabled: true
  kubeletConfiguration:
    maxPods: 110
  labels:
    scan-pipeline-node: "true"
  limits:
    resources:
      cpu: "2400"
      memory: 4800Gi
  providerRef:
    name: deafult-nodes-template
  requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - spot
  - key: node.kubernetes.io/instance-type
    operator: In
    values:
    - c6a.2xlarge
    - c5a.2xlarge
    - c5.2xlarge
    - c6a.xlarge
    - c5a.xlarge
    - c5.xlarge
  - key: topology.kubernetes.io/zone
    operator: In
    values:
    - us-east-2a
    - us-east-2b
  - key: kubernetes.io/arch
    operator: In
    values:
    - arm64
    - amd64
  - key: kubernetes.io/os
    operator: In
    values:
    - linux
  taints:
  - effect: NoSchedule
    key: scan-pipeline-node
status:
  resources:
    attachable-volumes-aws-ebs: "64"
    cpu: "12"
    ephemeral-storage: 123808920Ki
    memory: 24015712Ki
    pods: "220"

apiVersion: karpenter.k8s.aws/v1alpha1
kind: AWSNodeTemplate
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"karpenter.k8s.aws/v1alpha1","kind":"AWSNodeTemplate","metadata":{"annotations":{},"labels":{"app.kubernetes.io/instance":"prod"},"name":"deafult-nodes-template"},"spec":{"amiFamily":"Bottlerocket","blockDeviceMappings":[{"deviceName":"/dev/xvdb","ebs":{"deleteOnTermination":true,"encrypted":true,"volumeSize":"60Gi","volumeType":"gp3"}}],"instanceProfile":"eks-58c2b634-62f4-55dc-8638-f888aff9c840","securityGroupSelector":{"karpenter.sh/discovery":"prod"},"subnetSelector":{"karpenter.sh/discovery":"prod"},"userData":"[settings]\n[settings.kubernetes]\nallowed-unsafe-sysctls = [\"net.core.somaxconn\"]\nregistry-qps = 20\n"}}
  creationTimestamp: "2023-04-03T04:29:38Z"
  generation: 5
  labels:
    app.kubernetes.io/instance: prod
  name: deafult-nodes-template
  resourceVersion: "188516647"
  uid: f67f682c-0663-4fa2-b04f-4785fe2df2a3
spec:
  amiFamily: Bottlerocket
  blockDeviceMappings:
  - deviceName: /dev/xvdb
    ebs:
      deleteOnTermination: true
      encrypted: true
      volumeSize: 60Gi
      volumeType: gp3
  instanceProfile: eks-58c2b634-62f4-55dc-8638-f888aff9c840
  securityGroupSelector:
    karpenter.sh/discovery: prod
  subnetSelector:
    karpenter.sh/discovery: prod
  userData: |
    [settings]
    [settings.kubernetes]
    allowed-unsafe-sysctls = ["net.core.somaxconn"]
    registry-qps = 20

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave “+1” or “me too” comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

About this issue

Original URL
State: closed
Created a year ago
Reactions: 3
Comments: 20 (9 by maintainers)

Most upvoted comments

Our fix for not ready nodes is here: https://github.com/aws/karpenter-core/issues/750

@jonathan-innis @billrayburn, let’s discuss priority on this next week.

ellistarn on Apr 30, 2023

@tzneal Looks like consolidation is being slowed down when a single node is NotReady. Can such nodes be “ignored” while trying to simulate scheduling? I can understand the idea behind not ignoring such nodes, since those can become Ready at any given moment, but maybe implement some sort of “Ready timeout” mechanism, to avoid consolidation delay? Thanks!

(other than that, I think we can close this for now. thanks)

shay-ul on Apr 30, 2023

Does this churn where you are scheduling pods to the cluster happen a lot during the large scale-down from the nightly jobs that you are showing above?

When installing our helm charts, we are running some sort of a concurrency limit, which means that for some time, there should be no node consolidation since the cluster is at our configured max capacity. When pod count starts to drop (which menas - more helm charts are being uninstalled than installed) we expect to see node count start to drop as well.

Do you see “Blocked” nodes during the scale-down if do-not-evict pods are scheduling to those nodes

Yes, and those nodes should not be consolidated until the blocking pods are deleted.

We see some delay this morning as well. Again, pod number starts to drop at 3:50am (UTC+3), but node count starts to drop only at 4:31am:

Screenshot from 2023-04-26 08-53-44

I’m uploading the full controller logs to the support ticket.

shay-ul on Apr 26, 2023

@jonathan-innis I have installed the requested Karpenter version with requests, but without limits since I’m afraid it will be OOMKilled and be restarted during peak hours. Logs will be attached when we face this issue again, hopefully tomorrow morning. Thanks

shay-ul on Apr 22, 2023

Thanks @tzneal, I have created a ticket - 12553960191 The full log was attached there. Thanks

shay-ul on Apr 20, 2023