karpenter-provider-aws: Karpenter is not respecting per-node Daemonsets
Version
Karpenter: v0.7.2
Kubernetes: v1.21.5
Context
We run several different daemonsets on a per-nodes basis: metrics, logging, EBS CSI, secrets-store CSI. These need to be present on every node as they provide their functionality to every pod on a node.
(This could be a configuration / unset flag issue, looking for more information)
Expected Behavior
When choosing an instance type to provision for pending pods, Karpenter should take into account any Daemonsets that will be running on the node, not just the pending service
pods that it will schedule there.
Actual Behavior
This is most noticeable in a brand new cluster, but has also been seen with mature clusters:
When Karpenter brings up a node, it will correctly calculate the resources required to support the new service
pod / replica. The aws-node
and kube-proxy
pods will be started and then the service
pod.
When using a larger metrics / logging / CSI pod with requests of e.g 1Gb RAM / 0.5-1 CPU each, these pods will be perpetually stuck in a pending state and will never start, as there isn’t enough room on the node for them.
This was most noticeable when creating a new cluster when the aws-load-balancer-controller was deployed, which only requires 0.05 CPU. Therefore even with 3 replicas, Karpenter spun up a t3a.small
instance to support these.
Even when adding more replicas (tested with 25 replicas), it continued to spin up t3a.small
instances, presumably because they were the cheapest option, but leaving all the daemonset pods in a pending
state, apart from one node where there was only one aws-load-balancer-controller
pod - in this case one of the daemonset pods started, the rest were stuck in pending.
I believe this is due to how Karpenter is scheduling the pods on the node (something about node-binding in the docs?):
- As
aws-node
andkube-proxy
are in thesystem-node-critical
priority_class, they are always scheduled first - Potentially Karpenter is then scheduling the
service
pod next - The other daemonsets, some with a much higher priority_class, are not scheduled until after the
service
pod and therefore get stuck in a pending state if there is not enough room for them
Steps to Reproduce the Problem
- Create a fresh cluster with Karpenter deployed and a default provisioner
- Create
n
daemonsets with a highish resource consumption that will run on every node - Create a
service
deployment for a service with very low resource consumption, using the node selector for a karpenter provisioner - Karpenter should select an instance-type suitable for the
service
pods, but not able to support the daemonset(s)
Resource Specs and Logs
### Default Provisioner
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
annotations:
meta.helm.sh/release-name: karpenter-default-provisioner-chart
meta.helm.sh/release-namespace: default
labels:
app.kubernetes.io/managed-by: Helm
name: karpenter-default
spec:
labels:
env: karpenter-default
provider:
apiVersion: extensions.karpenter.sh/v1alpha1
kind: AWS
launchTemplate: <launch_template>
subnetSelector:
Service: Private
kubernetes.io/cluster/<cluster_name>: '*'
requirements:
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
- spot
- key: kubernetes.io/arch
operator: In
values:
- amd64
ttlSecondsAfterEmpty: 30
Logs
Do not have access to these logs at this time - but it was correctly trying to schedule the pending pods, and calculating the instance size based on the service
pod requests
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 9
- Comments: 43 (29 by maintainers)
This is limitation presents a serious problem - it means that new DaemonSet workloads can never be reliably scheduled on existing clusters. If the workaround is that all DaemonSets need to have a PriorityClass assigned, that should probably be documented at a minimum.
Hey, what’s the current situation? Is it intended that new DaemonSet pods will be Pending forever instead of making room for them by migrating other pods to new nodes? Can’t this behavior already be justified with the current
spec.consolidation.enabled
field?I just tried creating a test DaemonSet, pods were Pending. Then re-applied with
spec.template.spec.priorityClassName: system-node-critical
, same result. If there is a clear workaround (not that it’s ever logical to make pods Pending by design imo), please document it.@jonathan-innis It will be great if even if you deploy DS after, karpenter be able to detect such situation and re-combine pods and nodes to make sure that new DS is running as expected.
When do you plan to release this new logic?
I agree with this. It’s not something that Karpenter (or other autoscalers) support today, but I’d love to see some design work to make this happen.
I wonder if it’s worth exploring a KEP to implement support for this upstream.
You can easily resolve by signing your daemonsets a higher priorityClass. The k8s scheduler will then evict other pods to make room for the daemonset and move those other pods to me nodes.
Eventually karpenter consolidation feature might even desire to merge 2 nodes.
All works perfectly as long you give daemonsets a higher priority.
@missourian55 If you think that Karpenter is not calculating daemonset resources correctly for daemonsets that existed prior to the node being launched, please file another issue and include Karpenter logs and daemonset specs. This particular issue is about daemonsets that were created after the nodes were already launched.
That makes perfect sense. As you identified, if a pod has a higher priority-class than a DaemonSet Pod, then the DS Pod could end up displaced. Which would leave it in Pending state.
I think you’re right that you probably were seeing multiple issues. Some were fixed by updating priority-class, but others likely were fixed through the v0.8.2 release.
Let me know if we can close this out. Thanks again for reporting the issue!
Thanks @dewjam @bwagner5 - the issue outlined in https://github.com/aws/karpenter/issues/1573 sounds like it touches upon some of the same areas, so I am glad there is a fix in 🙂 I will test it out when it gets released and close this issue out after.