autoscaler: Cluster Autoscaler does not start new nodes when Taints and NodeSelector are used in EKS
Hi,
we use EKS with kubernetes 1.18 and the Cluster Autoscaler. With kubernetes 1.17 the “beta.kubernetes.io/Instance-type” is deprecated. We use instead the new “node.kubernetes.io/instance-type” as NodeSelector. This is working for autoscaling groups without taints. For the autoscaling groups with taints is the new “node.kubernetes.io/instance-type” selector not working and the cluster autoscaler doesn’t start new nodes. If we switch back to the old and deprecated “beta.kubernetes.io/instance-type” Selector the cluster autoscaler starts a new Node. We see this behavior on all of our EKS.
Events output for both Test PODs with beta and node.kubernetes.io as NodeSelector. POD with node.kubernetes.io selector was started first.
% kubectl get pods
NAME READY STATUS RESTARTS AGE
test-4xlarge-beta 0/1 Pending 0 41s
test-4xlarge-node 0/1 Pending 0 72s
% kubectl describe pod test-4xlarge-node
Name: test-4xlarge-node
Namespace: default
Priority: 0
Node: <none>
Labels: <none>
Annotations: kubernetes.io/psp: eks.privileged
Status: Pending
IP:
IPs: <none>
Containers:
test-4xlarge-node:
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-lzknk:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-lzknk
Optional: false
QoS Class: BestEffort
Node-Selectors: node.kubernetes.io/instance-type=c5a.4xlarge
Tolerations: disk=true:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NotTriggerScaleUp 88s cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 12 node(s) didn't match node selector
Warning FailedScheduling 9s (x8 over 92s) default-scheduler 0/35 nodes are available: 3 node(s) were unschedulable, 32 node(s) didn't match node selector.
% kubectl describe pod test-4xlarge-beta
Name: test-4xlarge-beta
Namespace: default
Priority: 0
Node: <none>
Labels: <none>
Annotations: kubernetes.io/psp: eks.privileged
Status: Pending
Containers:
test-4xlarge-beta:
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-lzknk:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-lzknk
Optional: false
QoS Class: BestEffort
Node-Selectors: beta.kubernetes.io/instance-type=c5a.4xlarge
Tolerations: disk=true:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal TriggeredScaleUp 47s cluster-autoscaler pod triggered scale-up: [{eks-agileci-cattle-disk-asg20201117110440315400000002 0->1 (max: 100)}]
Warning FailedScheduling 7s (x5 over 51s) default-scheduler 0/35 nodes are available: 3 node(s) were unschedulable, 32 node(s) didn't match node selector.
Which component are you using?: cluster-autoscaler
What version of the component are you using?: cluster-autoscaler release v1.18.3
What k8s version are you using (kubectl version)?: 1.18.9
kubectl version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.4", GitCommit:"d360454c9bcd1634cf4cc52d1867af5491dc9c5f", GitTreeState:"clean", BuildDate:"2020-11-12T01:09:16Z", GoVersion:"go1.15.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.9-eks-d1db3c", GitCommit:"d1db3c46e55f95d6a7d3e5578689371318f95ff9", GitTreeState:"clean", BuildDate:"2020-10-20T22:18:07Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
What did you expect to happen?: Cluster-Autoscaler starts a new Nodes What happened instead?: Cluster-Autoscaler doesn’t start a new Nodes. See the following Error.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NotTriggerScaleUp 88s cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 12 node(s) didn't match node selector
Warning FailedScheduling 9s (x8 over 92s) default-scheduler 0/35 nodes are available: 3 node(s) were unschedulable, 32 node(s) didn't match node selector.
How to reproduce it (as minimally and precisely as possible):
We use the following POD template to test the cluster-autoscaler.
Is Working:
apiVersion: v1
kind: Pod
metadata:
name: test-4xlarge-beta
spec:
restartPolicy: OnFailure
containers:
- name: test-4xlarge-beta
image: radial/busyboxplus
args:
- "sh"
tolerations:
- key: "disk"
operator: "Equal"
value: "true"
effect: "NoSchedule"
nodeSelector:
beta.kubernetes.io/instance-type: c5a.4xlarge
Is not Working:
apiVersion: v1
kind: Pod
metadata:
name: test-4xlarge-node
spec:
restartPolicy: OnFailure
containers:
- name: test-4xlarge-node
image: radial/busyboxplus
args:
- "sh"
tolerations:
- key: "disk"
operator: "Equal"
value: "true"
effect: "NoSchedule"
nodeSelector:
node.kubernetes.io/instance-type: c5a.4xlarge
Taints and Tags are configured on the ASG and also in kubelet configuration. See Screenshot

About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 13
- Comments: 37 (12 by maintainers)
I’m having the same issue, cluster-autoscaler fails to start a new node when requesting an instance type which is not yet online.
For example, when cluster does not have a large instance type
c5.24xlarge, cluster-autoscaler fails to start a new node with a pod launched with node selectornode.kubernetes.io/instance-type: c5.24xlarge, even though we have this exact instance type defined in the managed node group available instance types.Cluster autoscaler logs don’t contain anything meaningful, pod has
Hi, I think I found the root cause. When scaling from 0, aws_cloud_provide will generate nodeinfo from template(not real node). When generating, it forgot to add “node.kubernetes.io/instance-type” to the label. Check the code here aws_manager.go
I think I have confirmed that my hypothesis in https://github.com/kubernetes/autoscaler/issues/3802#issuecomment-846615442 is correct.
I’ve deployed a patched version with a workaround (not a fix), which has prevented the issue from re-occurring.
https://github.com/kubernetes/autoscaler/compare/cluster-autoscaler-1.21.0...lsowen:autoscaler-failure-workaround?expand=1
Basically, wait 5 minutes after the nodes is “ready” before caching the info about the node, which includes the labels. This prevents instance groups from being cached with missing labels.
As for a fix, I’m not sure the best way. A few options:
IsNodeReadyAndSchedulable(not sure what way other than the current implementation: https://github.com/kubernetes/autoscaler/blob/79a43dfe19545b5351db5dad28bbe27f6dea7574/cluster-autoscaler/utils/kubernetes/ready.go#L27)NodeReadycondition until after all node labels are registered.Option 3 seems the most robust, but is definitely the most complicated. I don’t even know where to begin. It might also be the root of my issue, because older versions of kubernetes (and thus older kubelet) didn’t seem to trigger this issue.
I have continued to experience this issue. And have tracked down part of the issue.
In the loop where it is checking the nodeGroups, it looks for a cached definition in the
nodeInfoCache: https://github.com/kubernetes/autoscaler/blob/79a43dfe19545b5351db5dad28bbe27f6dea7574/cluster-autoscaler/core/utils/utils.go#L103-L110For the groups which do have issues, the results are being returned from that cache, and the
nodeInfoCopy.node.ObjectMeta.Labelsis missing the expected labels. So the node templates are not matching the required “NodeAffinity.Filter()” (https://github.com/kubernetes/kubernetes/blob/d8f9e4587ac1265efd723bce74ae6a39576f2d58/pkg/scheduler/framework/plugins/nodeaffinity/node_affinity.go#L115)Labels from a “correct” group (which does autoscale up from 0):
Labels from an “incorrect” group (which does not autoscale up from 0 since it is missing the
workersizeandworkergrouplabels we use in our podnodeSelector):My guess is that the node is still “booting” when the info is cached, so not all labels have been added to the data which is permanently cached. Possibly
IsNodeReadyAndSchedulabletriggering too early?https://github.com/kubernetes/autoscaler/blob/79a43dfe19545b5351db5dad28bbe27f6dea7574/cluster-autoscaler/core/utils/utils.go#L80-L94
Restarting the cluster-autoscaler pod allows it to refresh all data from AWS, at which point the correct node groups are scaled up for the existing pending pods. Then, at some point in the next 24 or so hours one or more groups will stop scaling properly (which ones of our 10 or so groups stop failing seems to be random).
I’m seeing something similar, but I’m not using any
node.kubernetes.iolabels. When cluster-autoscaler (v1.20.0) is first launched, it successfully scales up from zero when needed by creatingtemplate-node-for-...template nodes. For a while, it works without issue, scaling up and down (even to and from 0). However, within 24 hours it stops being able to find a match for any ASG which has been scaled down to zero. I see no more log entries fortemplate-node-for-...., so I suspect the “actual definitions” of the ASG expire from a cache, and the logic for using the template node definition does not start back up. After this occurs, I start to see log messages like:Though this is the ASG which should scale up. Restarting the cluster-autoscaler “resolves” the issue (but is not a real solution, as this requires restarting the autoscaler every day at random times).
@olahouze - to get this working I needing to add this tag to my AWS Autoscaling Group
Make sure that
Tag new instancesis ticked as well.I then set the pod affinity to
The autoscaler picked up the change on the next cycle and scaled up the ASG from 0.
Hope this helps
We are able to get around this issue by using tags of labels as described here.
Has anyone been able to determine the root cause or a fix for this issue? We are currently having an issue where a customer using EKS does not see their nodes register correctly once they are scaled up from 0 (zero). Again, tanints and labels are used.
I had an issue with zero instance ASG’s and nodeselector not targeting correct node labels https://github.com/kubernetes/autoscaler/issues/4010 also on EKS
The scaling with node.kubernetes.io/instance-type is working without taints and also a scale up from 0. But, if you add a taint on the ASG the cluster autoscaler doesn’t scale up and report an error. If we switch back to the old beta.kubernetes.io/instance-type label it’s working and also scale up from 0. I think it’s not a problem of the tags on the ASG. We don’t set any tag like “beta.kubernetes.io” or “node.kubernetes.io”, this is not needed.
Hi,
we fixed our issues as follows and the cluster autoscaler is now able to start new Instances based on node selectors. In our use case, we used self managed ASG instead of Node Groups. That give us more flexibility to manage our Nodes.
We set the following tags on the ASGs. In this case with a Taint.
@olahouze - I agree with your thinking, I was also going to update all my helm charts. One point that I missed is that I also have a label in m eksctl nodegroup that matches the tag I just added. I suspect that cluster autoscaler will need the
tagand the Scheduler will need thelabel:There is also an advanced eksctl cluster example here which uses the
cluster-autoscalertags on nodegroups