karpenter: DaemonSets not being correctly calculated when choosing a node
Version
Karpenter Version: v0.24.0
Kubernetes Version: v1.21.0
Context
Due to the significant resource usage of certain Daemonsets, particularly when operating on larger machines, we have chosen to divide these Daemonsets based on affinity rules that use Karpenter’s labels such as karpenter.k8s.aws/instance-cpu
or karpenter.k8s.aws/instance-size
.
Expected Behavior
When selecting a node for provisioning, Karpenter must only consider the appropriate Daemonsets that will run on that node.
Actual Behavior
It appears that Karpenter is wrongly including all of the split Daemonsets instead of only the appropriate one, which can result in poor instance selection when provisioning new nodes or inaccurate consolidation actions.
Steps to Reproduce the Problem
- Create a fresh cluster with Karpenter deployed and a default provisioner:
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: default
spec:
providerRef:
name: default
requirements:
- key: "node.kubernetes.io/instance-type"
operator: In
values: ["c6i"]
- key: "karpenter.sh/capacity-type"
operator: In
values: ["on-demand"]
consolidation:
enabled: true
- Duplicate one of your Daemonsets and split them into small/large machines using the following settings:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: karpenter.k8s.aws/instance-cpu
operator: Lt
values:
- "31"
resources:
requests:
cpu: 1
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: karpenter.k8s.aws/instance-cpu
operator: Gt
values:
- "30"
resources:
requests:
cpu: 10
-
Create a simple Pod with 1 CPU request, Karpenter Should provision a 2 or max 4 cpu Instance but will instead provision a large >10 cpu machine due wrongly include the bigger Daemonset in the 2\4\8 cpu evaluation.
-
Same behavior when using
karpenter.k8s.aws/instance-size
or evenpodAntiAffinity
rules in the Daemonset affinities.
Thank you for your help in addressing this issue.
Community Note
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- Please do not leave “+1” or “me too” comments, they generate extra noise for issue followers and do not help prioritize the request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
About this issue
- Original URL
- State: open
- Created a year ago
- Reactions: 15
- Comments: 16 (7 by maintainers)
+1 here.
r5.4xlarge should have enough capacity, is there a way to solve this issue?
2023-11-17T07:24:17.807Z ERROR controller.provisioner Could not schedule pod, incompatible with provisioner "flink", daemonset overhead={"cpu":"455m","memory":"459383552","pods":"7"}, no instance type satisfied resources {"cpu":"11955m","memory":"126087176960","pods":"8"}
Karpenter Version: v0.29.2
+1 here. I have a set of cache nodes, created as a statefulset with a requirement of
62000m
of memory and a node selectorpurpose=cache
and a provisioner with instance type equal tox2gd.xlarge
. Here is a message from the log:incompatible with provisioner "cache", daemonset overhead={"cpu":"200m","memory":"128Mi","pods":"5"}, no instance type satisfied resources {"cpu":"200m","memory":"60128Mi","pods":"6"} and requirements karpenter.sh/capacity-type In [on-demand], karpenter.sh/provisioner-name In [cache], kubernetes.io/arch In [amd64 arm64], kubernetes.io/os In [linux], node.kubernetes.io/instance-type In [x2gd.xlarge], purpose In [cache] (no instance type which had enough resources and the required offering met the scheduling requirements);
x2gd.xlarge
type has 64gb of memory, so it should satisfy. Moreover, cluster-autoscaler, which I migrated the cluster from, works well in that case. Karpenter created a node only when I decreased a memory request to50Gi
.IF you set a required node affinity to not run on nodes with a label, and the provisioner is configured to apply that label to all nodes it launches, then we shouldn’t consider that daemonset for that provisioner.
Just to clarify:
Assuming we want to exclude a daemon set named
DS_X
from nodes with labelnode_type: NODE_Y
Setting label on the provisioner the
node_type: NODE_Y
label and setting the.spec.template.spec.nodeSelector
field on daemon setDS_X
to match allnode_type
labels butnode_type: NODE_Y
would work as expected because it’s known prior to adding the node?But setting label on the provisioner the
node_type: NODE_Y
label and setting the.spec.template.spec.affinity
with match expression to not run on nodes with thenode_type: NODE_Y
label won’t because it’s now known prior to adding the node?I am asking because we are experiencing this behaviour but it seems to me to be pretty much the same (and if I understand what you wrote correctly, it should be supported)
Yes, it works with taints/tolerations and labels on the provisioner. It doesn’t work for labels that need to be discovered from instance types that the provisioner might potentially launch.
Hey @kfirsch, tracked down the issue. So this is something that’s not currently supported with the scheduling code. The scheduling logic calculates the resource requirements of non daemonset pods differently than daemonsets.
Karpenter optimistically includes all daemonsets that are compatible with a Provisioner’s requirements during bin-packing. This means that Karpenter thinks the daemonset overhead for every instance type allowed by this Provisioner will be at least 11 vcpu, in this case, it’ll think there’s more overhead than there actually is for each instance type, which is why it tends to pick larger instance types.
To fix this would require a non-trivial amount of code changes to the scheduling logic, but it definitely is a bug.
In the meantime, if you’re able to use multiple provisioners for each of these daemonsets to ensure that the bin-packing only considers one of them at a time, that should solve your issue.