karpenter-provider-aws: karpenter.sh/initialized: true does not get applied to any Node

Version

Karpenter Version: v0.18.1

Kubernetes Version: v1.23 (AWS EKS)

Expected Behavior

Karpenter should bring nodes up as requested and when initialized should label the instance with karpenter.sh/initialized: true

On version v0.18.1 the label karpenter.sh/initialized: true is not set on ANY new instance.

This leads to such behaviour problems as deleting and consolidation not working as expected, in this case not removing or consolidating any nodes.

This has a huge cost hit.

Note this nodes have been used correctly by pods.

Actual Behavior

New nodes are correctly bought up and when ready should be labeled with karpenter.sh/initialized: true.

This then allows the behaviours to work as expected such as deletion and consolidation.

Steps to Reproduce the Problem

Bring up any node with version v0.18.1 and look at the labels you will see they are not labeled correctly.

To see the nodes are not labeled correctly we ran kubectl get node -L karpenter.sh/initialized see below for command output.

Resource Specs and Logs

kubectl get node -L karpenter.sh/initialized

NAME                           STATUS     ROLES    AGE     VERSION               INITIALIZED
ip-xxx.ec2.internal    Ready      <none>   36m     v1.23.9-eks-ba74326
ip-xxx.ec2.internal    Ready      <none>   37m     v1.23.9-eks-ba74326
ip-xxx.ec2.internal    Ready      <none>   37m     v1.23.9-eks-ba74326
ip-xxx.ec2.internal   Ready      <none>   14m     v1.23.9-eks-ba74326
ip-xxx.ec2.internal    Ready      <none>   3h14m   v1.23.9-eks-ba74326
ip-xxx.ec2.internal    Ready      <none>   14m     v1.23.9-eks-ba74326
ip-xxx.ec2.internal    Ready      <none>   37m     v1.23.9-eks-ba74326
ip-xxx.ec2.internal   Ready      <none>   14m     v1.23.9-eks-ba74326
ip-xxx.ec2.internal   Ready      <none>   37m     v1.23.9-eks-ba74326
ip-xxx.ec2.internal   Ready      <none>   14m     v1.23.9-eks-ba74326
ip-xxx.ec2.internal   Ready      <none>   3h14m   v1.23.9-eks-ba74326
ip-xxx.ec2.internal    Ready      <none>   37m     v1.23.9-eks-ba74326
ip-xxx.ec2.internal    Ready      <none>   37m     v1.23.9-eks-ba74326
ip-xxx.ec2.internal   Ready      <none>   37m     v1.23.9-eks-ba74326
ip-xxx.ec2.internal   Ready      <none>   4m26s   v1.23.9-eks-ba74326
ip-xxx.ec2.internal   Ready      <none>   37m     v1.23.9-eks-ba74326
ip-xxx.ec2.internal   Ready      <none>   37m     v1.23.9-eks-ba74326
ip-xxx.ec2.internal   Ready      <none>   37m     v1.23.9-eks-ba74326
ip-xxx.ec2.internal   Ready      <none>   3h18m   v1.23.9-eks-ba74326
ip-xxx.ec2.internal   Ready      <none>   14m     v1.23.9-eks-ba74326
ip-xxx.ec2.internal   Ready      <none>   19m     v1.23.9-eks-ba74326
ip-xxx.ec2.internal    Ready      <none>   4m27s   v1.23.9-eks-ba74326

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave “+1” or “me too” comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 7
  • Comments: 24 (15 by maintainers)

Most upvoted comments

This is normally caused by extended resources not registering or startup taints not being removed.

Are you using extended resources (e.g. GPUs or have AWS_ENABLE_POD_ENI turned on?) or startup taints?

Closing since #3408 was merged. This should be fixed in the next minor version release (v0.28.0)

Got it. I think the only options at this point are to either exclude the inf types from your provisioner or run the DS that registers the neuron resource. As mentioned by @ellistarn, this model where we do initialization based on expected resources should change to requested resources in #3408 so when that PR is merged and released, you should be able to use inf types without the DS.

Is there any resolution here?

@jonathan-innis was thinking about changing initialization to only require the resources requested by pods (e.g. if pods didn’t request the resources, we wouldn’t include it in initialization). Reopening for his comment.

The daemonset pod is using resources when we don’t really care for the special resources.

Is it possible for you to just scale down the resources of the daemonset to 0?