autoscaler: Node pool scale up timeout

The autoscaler has a timeout for non-ready nodes which forces it to kill those nodes and potentially select a different node pool in the next iteration. However, in the situation where the node pool cannot scale up at all it’ll happily wait forever, keeping pods in Pending state without trying to compensate.

For example, setting multiple AWS Spot node pools with different instance types, or setting up a Spot pool and an On Demand pool doesn’t really work. We’d expect CA to scale up one of the ASGs, detect a few minutes later that there’s still no nodes coming up (because the corresponding Spot pool doesn’t have capacity) and fall back to another pool. What actually happens is that CA will scale up the node pool by increasing desired capacity and then not do anything at all other than printing Upcoming 1 nodes/Failed to find readiness information for ....

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Comments: 30 (18 by maintainers)

Most upvoted comments

Just to add my test results to this issue, if it helps any…

I have 3x ASGs (2 spot, 1 normal) I have unschedulable pods, CA triggers scale up:

I0417 13:42:04.765399       1 scale_up.go:427] Best option to resize: eu01-stg-spot-2
I0417 13:42:04.765417       1 scale_up.go:431] Estimated 1 nodes needed in eu01-stg-spot-2
I0417 13:42:04.765439       1 scale_up.go:533] Final scale-up plan: [{eu01-stg-spot-2 2->3 (max: 20)}]

But spot price is not fulfilled so instance is not created. Then max-node-provision-time passes:

W0417 14:03:32.422046       1 clusterstate.go:198] Scale-up timed out for node group eu01-stg-spot-1 after 15m8.316405684s
W0417 14:03:32.422109       1 clusterstate.go:221] Disabling scale-up for node group eu01-stg-spot-1 until 2019-04-17 14:08:32.247733959 +0000 UTC m=+4217.943746338
W0417 14:03:32.532256       1 scale_up.go:329] Node group eu01-stg-spot-1 is not ready for scaleup - backoff

Now at this point I would expect CA to immediately choose 1 of the other 2 available ASGs but it does not:

I0417 14:09:23.451114       1 scale_up.go:412] No need for any nodes in eu01-stg
I0417 14:09:23.451516       1 scale_up.go:412] No need for any nodes in eu01-stg-spot-1
I0417 14:09:23.452883       1 scale_up.go:412] No need for any nodes in eu01-stg-spot-2

And pods are left unschedulable.

CA handles this by resizing the node group back to original size after timed-out scale-up. This is done in https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/static_autoscaler.go#L180. However, going through the code it looks like this may not work for timed-out scale-from-0. It would be signified by lines in log looking like Readiness for node group <group> not found, which I do see in your log.

I’ll try to reproduce later to confirm this theory (not enough time this week, sorry), but I’m fairly confident that’s what’s happening. If I’m right it’s a bug in clusterstate. I have an idea how to fix it, but clusterstate is not the easiest thing to reason about and I need to have some time to dig into it to make sure I’m not breaking anything.

EDIT: This was not a fault on Kubernetes side, i simply ran into an IP Address Quota on Google side…


Stumbled across this too. I tried to use a scale-to-zero pool besides my existing cluster for new CI/CD gitlab workers. Running on GKE with 1.13.6-gke.0.

Sadly there is no scaling above 1 node, thus 5 are allowed. Main-Cluster is currently running on 4 nodes.

Tried with and without preemptible nodes (guess the AWS term is SPOT for this)

pod description:

Events:
  Type     Reason             Age                From                Message
  ----     ------             ----               ----                -------
  Warning  FailedScheduling   16s (x2 over 16s)  default-scheduler   0/5 nodes are available: 4 node(s) didn't match node selector, 5 Insufficient cpu.
  Normal   NotTriggerScaleUp  4s (x2 over 15s)   cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 Insufficient cpu, 1 in backoff after failed scale-up