autoscaler: CA does not scale up from zero nodes in group

What happened: Cluster Autoscaler will not scale up from zero nodes. However, it will scale up from one node. I have a node group whose template includes p2.xlarge GPU instances. With zero running instances in my gpu-nodes node group, I create a new Job that requests 2 pods, each with 1 GPU. The pods are unschedulable, and CA logs show: I0524 15:30:32.066956 1 factory.go:33] Event(v1.ObjectReference{Kind:"Pod", Namespace:"engine", Name:"distributed-job-2xp8n", UID:"34fa9255-5f67-11e8-bede-068abf0075c0", APIVersion:"v1", ResourceVersion:"98300", FieldPath:""}): type: 'Normal' reason: 'NotTriggerScaleUp' pod didn't trigger scale-up (it wouldn't fit if a new node is added) The pods never get created, and no GPU instances get spun up.

What you expected to happen: CA should scale up the cluster by adding two p2.xlarge instances to the gpu-nodes group.

How to reproduce it (as minimally and precisely as possible): In a kops cluster on AWS:

  1. Create a node group which includes p2.xlarge as the instance type.
  2. Create a Job that requests 2 containers, each requesting a single nvidia.com/gpu.

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): Client Version: version.Info{Major:“1”, Minor:“10”, GitVersion:“v1.10.3”, GitCommit:“2bba0127d85d5a46ab4b778548be28623b32d0b0”, GitTreeState:“clean”, BuildDate:“2018-05-21T09:17:39Z”, GoVersion:“go1.9.3”, Compiler:“gc”, Platform:“linux/amd64”} Server Version: version.Info{Major:“1”, Minor:“8”, GitVersion:“v1.8.7”, GitCommit:“b30876a5539f09684ff9fde266fda10b37738c9c”, GitTreeState:“clean”, BuildDate:“2018-01-16T21:52:38Z”, GoVersion:“go1.8.3”, Compiler:“gc”, Platform:“linux/amd64”}

  • Cloud provider or hardware configuration: AWS

  • OS (e.g. from /etc/os-release): k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-02-08

  • Kernel (e.g. uname -a): Linux ip-172-20-40-189 4.4.115-k8s #1 SMP Thu Feb 8 15:37:40 UTC 2018 x86_64 GNU/Linux

  • Install tools: kops 1.8.1

  • Others: CA image: gcr.io/google_containers/cluster-autoscaler:v1.0.5

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 2
  • Comments: 17 (8 by maintainers)

Most upvoted comments

@icy When scaling from 0 nodes CA guesses how a new node would look like and checks if the pending pods would be able to run on this node. In your case the node predicted by CA doesn’t have the label requested by pod using nodeSelector or nodeAffinity. The logic of guessing what labels the first node in a given node group would have is specific to a given cloudprovider and (unless you’re using hosted k8s such as GKE) requires some kind of manual tagging of underlying cloudprovider autoscaling group. The details are described in README of each cloudprovider.