karpenter-provider-aws: Karpenter is not respecting per-node Daemonsets

Version

Karpenter: v0.7.2

Kubernetes: v1.21.5

Context

We run several different daemonsets on a per-nodes basis: metrics, logging, EBS CSI, secrets-store CSI. These need to be present on every node as they provide their functionality to every pod on a node.

(This could be a configuration / unset flag issue, looking for more information)

Expected Behavior

When choosing an instance type to provision for pending pods, Karpenter should take into account any Daemonsets that will be running on the node, not just the pending service pods that it will schedule there.

Actual Behavior

This is most noticeable in a brand new cluster, but has also been seen with mature clusters: When Karpenter brings up a node, it will correctly calculate the resources required to support the new service pod / replica. The aws-node and kube-proxy pods will be started and then the service pod.

When using a larger metrics / logging / CSI pod with requests of e.g 1Gb RAM / 0.5-1 CPU each, these pods will be perpetually stuck in a pending state and will never start, as there isn’t enough room on the node for them.

This was most noticeable when creating a new cluster when the aws-load-balancer-controller was deployed, which only requires 0.05 CPU. Therefore even with 3 replicas, Karpenter spun up a t3a.small instance to support these. Even when adding more replicas (tested with 25 replicas), it continued to spin up t3a.small instances, presumably because they were the cheapest option, but leaving all the daemonset pods in a pending state, apart from one node where there was only one aws-load-balancer-controller pod - in this case one of the daemonset pods started, the rest were stuck in pending.

I believe this is due to how Karpenter is scheduling the pods on the node (something about node-binding in the docs?):

  • As aws-node and kube-proxy are in the system-node-critical priority_class, they are always scheduled first
  • Potentially Karpenter is then scheduling the service pod next
  • The other daemonsets, some with a much higher priority_class, are not scheduled until after the service pod and therefore get stuck in a pending state if there is not enough room for them

Steps to Reproduce the Problem

  • Create a fresh cluster with Karpenter deployed and a default provisioner
  • Create n daemonsets with a highish resource consumption that will run on every node
  • Create a service deployment for a service with very low resource consumption, using the node selector for a karpenter provisioner
  • Karpenter should select an instance-type suitable for the service pods, but not able to support the daemonset(s)

Resource Specs and Logs

### Default Provisioner
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  annotations:
    meta.helm.sh/release-name: karpenter-default-provisioner-chart
    meta.helm.sh/release-namespace: default
  labels:
    app.kubernetes.io/managed-by: Helm
  name: karpenter-default
spec:
  labels:
    env: karpenter-default
  provider:
    apiVersion: extensions.karpenter.sh/v1alpha1
    kind: AWS
    launchTemplate: <launch_template>
    subnetSelector:
      Service: Private
      kubernetes.io/cluster/<cluster_name>: '*'
  requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values:
    - on-demand
    - spot
  - key: kubernetes.io/arch
    operator: In
    values:
    - amd64
  ttlSecondsAfterEmpty: 30

Logs

Do not have access to these logs at this time - but it was correctly trying to schedule the pending pods, and calculating the instance size based on the service pod requests

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 9
  • Comments: 43 (29 by maintainers)

Most upvoted comments

In short, DaemonSets should be applied before Deployments, otherwise DaemonSet pods may remain in a Pending state.

This is limitation presents a serious problem - it means that new DaemonSet workloads can never be reliably scheduled on existing clusters. If the workaround is that all DaemonSets need to have a PriorityClass assigned, that should probably be documented at a minimum.

Hey, what’s the current situation? Is it intended that new DaemonSet pods will be Pending forever instead of making room for them by migrating other pods to new nodes? Can’t this behavior already be justified with the current spec.consolidation.enabled field?

I just tried creating a test DaemonSet, pods were Pending. Then re-applied with spec.template.spec.priorityClassName: system-node-critical, same result. If there is a clear workaround (not that it’s ever logical to make pods Pending by design imo), please document it.

@jonathan-innis It will be great if even if you deploy DS after, karpenter be able to detect such situation and re-combine pods and nodes to make sure that new DS is running as expected.

When do you plan to release this new logic?

This seems like the core issue, no?

I agree with this. It’s not something that Karpenter (or other autoscalers) support today, but I’d love to see some design work to make this happen.

There’s nothing inherent to daemonsets that makes this statement true

I wonder if it’s worth exploring a KEP to implement support for this upstream.

You can easily resolve by signing your daemonsets a higher priorityClass. The k8s scheduler will then evict other pods to make room for the daemonset and move those other pods to me nodes.

Eventually karpenter consolidation feature might even desire to merge 2 nodes.

All works perfectly as long you give daemonsets a higher priority.

@missourian55 If you think that Karpenter is not calculating daemonset resources correctly for daemonsets that existed prior to the node being launched, please file another issue and include Karpenter logs and daemonset specs. This particular issue is about daemonsets that were created after the nodes were already launched.

That makes perfect sense. As you identified, if a pod has a higher priority-class than a DaemonSet Pod, then the DS Pod could end up displaced. Which would leave it in Pending state.

I think you’re right that you probably were seeing multiple issues. Some were fixed by updating priority-class, but others likely were fixed through the v0.8.2 release.

Let me know if we can close this out. Thanks again for reporting the issue!

Thanks @dewjam @bwagner5 - the issue outlined in https://github.com/aws/karpenter/issues/1573 sounds like it touches upon some of the same areas, so I am glad there is a fix in 🙂 I will test it out when it gets released and close this issue out after.