autoscaler: Cluster Autoscaler does not start new nodes when Taints and NodeSelector are used in EKS

Hi,

we use EKS with kubernetes 1.18 and the Cluster Autoscaler. With kubernetes 1.17 the “beta.kubernetes.io/Instance-type” is deprecated. We use instead the new “node.kubernetes.io/instance-type” as NodeSelector. This is working for autoscaling groups without taints. For the autoscaling groups with taints is the new “node.kubernetes.io/instance-type” selector not working and the cluster autoscaler doesn’t start new nodes. If we switch back to the old and deprecated “beta.kubernetes.io/instance-type” Selector the cluster autoscaler starts a new Node. We see this behavior on all of our EKS.

Events output for both Test PODs with beta and node.kubernetes.io as NodeSelector. POD with node.kubernetes.io selector was started first.

% kubectl get pods
NAME                READY   STATUS    RESTARTS   AGE
test-4xlarge-beta   0/1     Pending   0          41s
test-4xlarge-node   0/1     Pending   0          72s

% kubectl describe pod test-4xlarge-node
Name:         test-4xlarge-node
Namespace:    default
Priority:     0
Node:         <none>
Labels:       <none>
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Pending
IP:           
IPs:          <none>
Containers:
  test-4xlarge-node:
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  default-token-lzknk:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-lzknk
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  node.kubernetes.io/instance-type=c5a.4xlarge
Tolerations:     disk=true:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason             Age               From                Message
  ----     ------             ----              ----                -------
  Normal   NotTriggerScaleUp  88s               cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 12 node(s) didn't match node selector
  Warning  FailedScheduling   9s (x8 over 92s)  default-scheduler   0/35 nodes are available: 3 node(s) were unschedulable, 32 node(s) didn't match node selector.


% kubectl describe pod test-4xlarge-beta
Name:         test-4xlarge-beta
Namespace:    default
Priority:     0
Node:         <none>
Labels:       <none>
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Pending
Containers:
  test-4xlarge-beta:     
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  default-token-lzknk:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-lzknk
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  beta.kubernetes.io/instance-type=c5a.4xlarge
Tolerations:     disk=true:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age               From                Message
  ----     ------            ----              ----                -------
  Normal   TriggeredScaleUp  47s               cluster-autoscaler  pod triggered scale-up: [{eks-agileci-cattle-disk-asg20201117110440315400000002 0->1 (max: 100)}]
  Warning  FailedScheduling  7s (x5 over 51s)  default-scheduler   0/35 nodes are available: 3 node(s) were unschedulable, 32 node(s) didn't match node selector.

Which component are you using?: cluster-autoscaler What version of the component are you using?: cluster-autoscaler release v1.18.3 What k8s version are you using (kubectl version)?: 1.18.9

kubectl version
Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.4", GitCommit:"d360454c9bcd1634cf4cc52d1867af5491dc9c5f", GitTreeState:"clean", BuildDate:"2020-11-12T01:09:16Z", GoVersion:"go1.15.4", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.9-eks-d1db3c", GitCommit:"d1db3c46e55f95d6a7d3e5578689371318f95ff9", GitTreeState:"clean", BuildDate:"2020-10-20T22:18:07Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

What did you expect to happen?: Cluster-Autoscaler starts a new Nodes What happened instead?: Cluster-Autoscaler doesn’t start a new Nodes. See the following Error.

Events:
  Type     Reason             Age               From                Message
  ----     ------             ----              ----                -------
  Normal   NotTriggerScaleUp  88s               cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 12 node(s) didn't match node selector
  Warning  FailedScheduling   9s (x8 over 92s)  default-scheduler   0/35 nodes are available: 3 node(s) were unschedulable, 32 node(s) didn't match node selector.

How to reproduce it (as minimally and precisely as possible):

We use the following POD template to test the cluster-autoscaler.

Is Working:

apiVersion: v1
kind: Pod
metadata:
  name: test-4xlarge-beta
spec:
  restartPolicy: OnFailure
  containers:
  - name: test-4xlarge-beta
    image: radial/busyboxplus
    args:
    - "sh"
  tolerations:
  - key: "disk"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"
  nodeSelector:
    beta.kubernetes.io/instance-type: c5a.4xlarge

Is not Working:

apiVersion: v1
kind: Pod
metadata:
  name: test-4xlarge-node
spec:
  restartPolicy: OnFailure
  containers:
  - name: test-4xlarge-node
    image: radial/busyboxplus
    args:
    - "sh"
  tolerations:
  - key: "disk"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"
  nodeSelector:
    node.kubernetes.io/instance-type: c5a.4xlarge

Taints and Tags are configured on the ASG and also in kubelet configuration. See Screenshot

Xnip2021-01-11_16-50-51

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 13
  • Comments: 37 (12 by maintainers)

Most upvoted comments

I’m having the same issue, cluster-autoscaler fails to start a new node when requesting an instance type which is not yet online.

For example, when cluster does not have a large instance type c5.24xlarge, cluster-autoscaler fails to start a new node with a pod launched with node selector node.kubernetes.io/instance-type: c5.24xlarge, even though we have this exact instance type defined in the managed node group available instance types.

Cluster autoscaler logs don’t contain anything meaningful, pod has

pod didn't trigger scale-up: 2 node(s) didn't match Pod's node affinity

Hi, I think I found the root cause. When scaling from 0, aws_cloud_provide will generate nodeinfo from template(not real node). When generating, it forgot to add “node.kubernetes.io/instance-type” to the label. Check the code here aws_manager.go

I think I have confirmed that my hypothesis in https://github.com/kubernetes/autoscaler/issues/3802#issuecomment-846615442 is correct.

I’ve deployed a patched version with a workaround (not a fix), which has prevented the issue from re-occurring.

https://github.com/kubernetes/autoscaler/compare/cluster-autoscaler-1.21.0...lsowen:autoscaler-failure-workaround?expand=1

Basically, wait 5 minutes after the nodes is “ready” before caching the info about the node, which includes the labels. This prevents instance groups from being cached with missing labels.

As for a fix, I’m not sure the best way. A few options:

  1. A configurable “timeout” similar to my workaround, to delay caching
  2. A different way of determining IsNodeReadyAndSchedulable (not sure what way other than the current implementation: https://github.com/kubernetes/autoscaler/blob/79a43dfe19545b5351db5dad28bbe27f6dea7574/cluster-autoscaler/utils/kubernetes/ready.go#L27)
  3. A change to kubelet to not set the NodeReady condition until after all node labels are registered.

Option 3 seems the most robust, but is definitely the most complicated. I don’t even know where to begin. It might also be the root of my issue, because older versions of kubernetes (and thus older kubelet) didn’t seem to trigger this issue.

I have continued to experience this issue. And have tracked down part of the issue.

In the loop where it is checking the nodeGroups, it looks for a cached definition in the nodeInfoCache: https://github.com/kubernetes/autoscaler/blob/79a43dfe19545b5351db5dad28bbe27f6dea7574/cluster-autoscaler/core/utils/utils.go#L103-L110

For the groups which do have issues, the results are being returned from that cache, and the nodeInfoCopy.node.ObjectMeta.Labels is missing the expected labels. So the node templates are not matching the required “NodeAffinity.Filter()” (https://github.com/kubernetes/kubernetes/blob/d8f9e4587ac1265efd723bce74ae6a39576f2d58/pkg/scheduler/framework/plugins/nodeaffinity/node_affinity.go#L115)

Labels from a “correct” group (which does autoscale up from 0):

                Labels: map[string]string [                                                                                                                    
                        "kubernetes.io/os": "linux",                                                                                                           
                        "kops.k8s.io/instancegroup": "workers-devstage-large-spot",                
                        "spotinstance": "yes",                                                                                                                 
                        "kubernetes.io/arch": "amd64",                         
                        "workergroup": "devstage", 
                        "topology.kubernetes.io/zone": "us-east-1a", 
                        "node-role.kubernetes.io/spot-worker": "true", 
                        "kubernetes.io/hostname": "template-node-for-workers-devstage-large-spot.cluster-01....+22 more", 
                        "node.kubernetes.io/instance-type": "r5.24xlarge", 
                        "beta.kubernetes.io/os": "linux", 
                        "beta.kubernetes.io/arch": "amd64", 
                        "nodetype": "worker", 
                        "failure-domain.beta.kubernetes.io/region": "us-east-1", 
                        "topology.kubernetes.io/region": "us-east-1", 
                        "beta.kubernetes.io/instance-type": "r5.24xlarge", 
                        "node-role.kubernetes.io/node": "", 
                        "failure-domain.beta.kubernetes.io/zone": "us-east-1a",  
                        "kubernetes.io/role": "node", 
                        "workersize": "large", 
                ],  

Labels from an “incorrect” group (which does not autoscale up from 0 since it is missing the workersize and workergroup labels we use in our pod nodeSelector):

                Labels: map[string]string [                                                                                                                                                                                                                                                                                   
                        "topology.kubernetes.io/region": "us-east-1",                                                                                                                                                                                                                                                         
                        "node.kubernetes.io/instance-type": "c5.12xlarge",                                                                                                                                                                                                                                                    
                        "topology.kubernetes.io/zone": "us-east-1a",                                                                                                                                                                                                                                                          
                        "beta.kubernetes.io/instance-type": "c5.12xlarge",                                                                                                                                                                                                                                                    
                        "kubernetes.io/os": "linux",                                                                                                                                                                                                                                                                          
                        "beta.kubernetes.io/arch": "amd64",                                                                                                                                                                                                                                                                   
                        "beta.kubernetes.io/os": "linux",                                                                                                                                                                                                                                                                     
                        "kubernetes.io/arch": "amd64",                                                                                                                                                                                                                                                                        
                        "failure-domain.beta.kubernetes.io/region": "us-east-1",                                                                                                                                                                                                                                              
                        "kubernetes.io/hostname": "template-node-for-workers-dev-normal-spot.cluster-01.-2...+18 more",                                                                                                                                                                                              
                        "failure-domain.beta.kubernetes.io/zone": "us-east-1a",                                                                                                                                                                                                                                               
                ],  

My guess is that the node is still “booting” when the info is cached, so not all labels have been added to the data which is permanently cached. Possibly IsNodeReadyAndSchedulable triggering too early?

https://github.com/kubernetes/autoscaler/blob/79a43dfe19545b5351db5dad28bbe27f6dea7574/cluster-autoscaler/core/utils/utils.go#L80-L94

Restarting the cluster-autoscaler pod allows it to refresh all data from AWS, at which point the correct node groups are scaled up for the existing pending pods. Then, at some point in the next 24 or so hours one or more groups will stop scaling properly (which ones of our 10 or so groups stop failing seems to be random).

Hi, yes we have the same feeling that the autoscaler forget the “node.kubernetes.io” labels, but not immediately. Some minutes after the shutdown of the last node in the ASG it’s also working with “node.kubernetes.io” but not after some hours. A fix will maybe solve also the following issue: Scale up windows

I’m seeing something similar, but I’m not using any node.kubernetes.io labels. When cluster-autoscaler (v1.20.0) is first launched, it successfully scales up from zero when needed by creating template-node-for-... template nodes. For a while, it works without issue, scaling up and down (even to and from 0). However, within 24 hours it stops being able to find a match for any ASG which has been scaled down to zero. I see no more log entries for template-node-for-...., so I suspect the “actual definitions” of the ASG expire from a cache, and the logic for using the template node definition does not start back up. After this occurs, I start to see log messages like:

Pod <POD_NAME> can't be scheduled on <ASG_NAME>, predicate checking error: node(s) didn't match Pod's node affinity; predicateName=NodeAffinity; reasons: node(s) didn't match Pod's node affinity; debugInfo=

Though this is the ASG which should scale up. Restarting the cluster-autoscaler “resolves” the issue (but is not a real solution, as this requires restarting the autoscaler every day at random times).

@olahouze - to get this working I needing to add this tag to my AWS Autoscaling Group

k8s.io/cluster-autoscaler/node-template/label/nodegroup-type: stateless

Make sure that Tag new instances is ticked as well.

I then set the pod affinity to

affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
            - key: eks.amazonaws.com/capacityType
              operator: In
              values:
                - SPOT
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: nodegroup-type
              operator: In
              values:
                - stateless

The autoscaler picked up the change on the next cycle and scaled up the ASG from 0.

Hope this helps

Has anyone been able to determine the root cause or a fix for this issue? We are currently having an issue where a customer using EKS does not see their nodes register correctly once they are scaled up from 0 (zero). Again, tanints and labels are used.

We are able to get around this issue by using tags of labels as described here.

Has anyone been able to determine the root cause or a fix for this issue? We are currently having an issue where a customer using EKS does not see their nodes register correctly once they are scaled up from 0 (zero). Again, tanints and labels are used.

I had an issue with zero instance ASG’s and nodeselector not targeting correct node labels https://github.com/kubernetes/autoscaler/issues/4010 also on EKS

The scaling with node.kubernetes.io/instance-type is working without taints and also a scale up from 0. But, if you add a taint on the ASG the cluster autoscaler doesn’t scale up and report an error. If we switch back to the old beta.kubernetes.io/instance-type label it’s working and also scale up from 0. I think it’s not a problem of the tags on the ASG. We don’t set any tag like “beta.kubernetes.io” or “node.kubernetes.io”, this is not needed.

Hi,

we fixed our issues as follows and the cluster autoscaler is now able to start new Instances based on node selectors. In our use case, we used self managed ASG instead of Node Groups. That give us more flexibility to manage our Nodes.

We set the following tags on the ASGs. In this case with a Taint.

Tags Value Tag new instances
k8s.io/cluster-autoscaler/enabled true Yes
k8s.io/cluster-autoscaler/node-template/label/kubernetes.io/arch amd64 Yes
k8s.io/cluster-autoscaler/node-template/label/kubernetes.io/os linux Yes
k8s.io/cluster-autoscaler/node-template/label/node.kubernetes.io/instance-type g4dn.2xlarge Yes
k8s.io/cluster-autoscaler/node-template/label/node.kubernetes.io/lifecycle on-demand Yes
k8s.io/cluster-autoscaler/node-template/label/topology.ebs.csi.aws.com/zone eu-central-1b Yes
k8s.io/cluster-autoscaler/node-template/label/topology.kubernetes.io/zone eu-central-1b Yes
k8s.io/cluster-autoscaler/node-template/taint/gpu true:NoSchedule Yes
kubernetes.io/cluster/eks-XXXXXXXXXXXXXX owned Yes

@olahouze - I agree with your thinking, I was also going to update all my helm charts. One point that I missed is that I also have a label in m eksctl nodegroup that matches the tag I just added. I suspect that cluster autoscaler will need the tag and the Scheduler will need the label:

  - name: ng-2-stateless-spot-1a
    spot: true
    tags:
      k8s.io/cluster-autoscaler/node-template/label/nodegroup-type: stateless
    labels:
      nodegroup-type: stateless
      instance-type: spot

There is also an advanced eksctl cluster example here which uses the cluster-autoscaler tags on nodegroups