karpenter: DaemonSets not being correctly calculated when choosing a node

Version

Karpenter Version: v0.24.0

Kubernetes Version: v1.21.0

Context

Due to the significant resource usage of certain Daemonsets, particularly when operating on larger machines, we have chosen to divide these Daemonsets based on affinity rules that use Karpenter’s labels such as karpenter.k8s.aws/instance-cpu or karpenter.k8s.aws/instance-size.

Expected Behavior

When selecting a node for provisioning, Karpenter must only consider the appropriate Daemonsets that will run on that node.

Actual Behavior

It appears that Karpenter is wrongly including all of the split Daemonsets instead of only the appropriate one, which can result in poor instance selection when provisioning new nodes or inaccurate consolidation actions.

Steps to Reproduce the Problem

Create a fresh cluster with Karpenter deployed and a default provisioner:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: default
spec:
  providerRef:
    name: default
  requirements:
    - key: "node.kubernetes.io/instance-type"
      operator: In
      values: ["c6i"]
    - key: "karpenter.sh/capacity-type"
      operator: In
      values: ["on-demand"]
  consolidation:
    enabled: true

Duplicate one of your Daemonsets and split them into small/large machines using the following settings:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: karpenter.k8s.aws/instance-cpu
          operator: Lt
          values:
          - "31"

  resources:
    requests:
      cpu: 1

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: karpenter.k8s.aws/instance-cpu
          operator: Gt
          values:
          - "30"

  resources:
    requests:
      cpu: 10

Create a simple Pod with 1 CPU request, Karpenter Should provision a 2 or max 4 cpu Instance but will instead provision a large >10 cpu machine due wrongly include the bigger Daemonset in the 2\4\8 cpu evaluation.
Same behavior when using karpenter.k8s.aws/instance-size or even podAntiAffinity rules in the Daemonset affinities.

Thank you for your help in addressing this issue.

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave “+1” or “me too” comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

About this issue

Original URL
State: open
Created a year ago
Reactions: 15
Comments: 16 (7 by maintainers)

Commits related to this issue

Update faq.md Addressing the issue - https://github.com/aws/karpenter/issues/3634 where daemonset calculation of resources is not working as expected. — committed to uditsidana/karpenter by uditsidana a year ago

Most upvoted comments

+1 here.

r5.4xlarge should have enough capacity, is there a way to solve this issue?

2023-11-17T07:24:17.807Z ERROR controller.provisioner Could not schedule pod, incompatible with provisioner "flink", daemonset overhead={"cpu":"455m","memory":"459383552","pods":"7"}, no instance type satisfied resources {"cpu":"11955m","memory":"126087176960","pods":"8"}

Karpenter Version: v0.29.2

kcchy on Nov 23, 2023

+1 here. I have a set of cache nodes, created as a statefulset with a requirement of 62000m of memory and a node selector purpose=cache and a provisioner with instance type equal to x2gd.xlarge. Here is a message from the log: incompatible with provisioner "cache", daemonset overhead={"cpu":"200m","memory":"128Mi","pods":"5"}, no instance type satisfied resources {"cpu":"200m","memory":"60128Mi","pods":"6"} and requirements karpenter.sh/capacity-type In [on-demand], karpenter.sh/provisioner-name In [cache], kubernetes.io/arch In [amd64 arm64], kubernetes.io/os In [linux], node.kubernetes.io/instance-type In [x2gd.xlarge], purpose In [cache] (no instance type which had enough resources and the required offering met the scheduling requirements);

x2gd.xlarge type has 64gb of memory, so it should satisfy. Moreover, cluster-autoscaler, which I migrated the cluster from, works well in that case. Karpenter created a node only when I decreased a memory request to 50Gi.

Onlinehead on Sep 5, 2023

But setting label on the provisioner the node_type: NODE_Y label and setting the .spec.template.spec.affinity with match expression to not run on nodes with the node_type: NODE_Y label won’t because it’s now known prior to adding the node?

IF you set a required node affinity to not run on nodes with a label, and the provisioner is configured to apply that label to all nodes it launches, then we shouldn’t consider that daemonset for that provisioner.

tzneal on Jul 23, 2023

Yes, it works with taints/tolerations and labels on the provisioner. It doesn’t work for labels that need to be discovered from instance types that the provisioner might potentially launch.

Just to clarify:

Assuming we want to exclude a daemon set named DS_X from nodes with label node_type: NODE_Y

Setting label on the provisioner the node_type: NODE_Y label and setting the .spec.template.spec.nodeSelector field on daemon set DS_X to match all node_type labels but node_type: NODE_Y would work as expected because it’s known prior to adding the node?

But setting label on the provisioner the node_type: NODE_Y label and setting the .spec.template.spec.affinity with match expression to not run on nodes with the node_type: NODE_Y label won’t because it’s now known prior to adding the node?

I am asking because we are experiencing this behaviour but it seems to me to be pretty much the same (and if I understand what you wrote correctly, it should be supported)

uristernik on May 30, 2023

Yes, it works with taints/tolerations and labels on the provisioner. It doesn’t work for labels that need to be discovered from instance types that the provisioner might potentially launch.

tzneal on Apr 14, 2023

Hey @kfirsch, tracked down the issue. So this is something that’s not currently supported with the scheduling code. The scheduling logic calculates the resource requirements of non daemonset pods differently than daemonsets.

Karpenter optimistically includes all daemonsets that are compatible with a Provisioner’s requirements during bin-packing. This means that Karpenter thinks the daemonset overhead for every instance type allowed by this Provisioner will be at least 11 vcpu, in this case, it’ll think there’s more overhead than there actually is for each instance type, which is why it tends to pick larger instance types.

To fix this would require a non-trivial amount of code changes to the scheduling logic, but it definitely is a bug.

In the meantime, if you’re able to use multiple provisioners for each of these daemonsets to ensure that the bin-packing only considers one of them at a time, that should solve your issue.

njtran on Mar 23, 2023