autoscaler: cluster autoscaler failed to scale up when AWS couldn't start a new instance
We’re using recent cluster-autoscaler, built from the master branch. It’s used with AWS EKS and k8s 1.12.7. We’re using it to run a cluster based on AWS spot instances.
Recently we had a situation, where CAS failed to scale up the cluster. During scale-up, 2 ASGs were selected to increase the capacity, but AWS had no spot capacity in them, so no new instances were really started (they were stuck in a state like requested instances = 5, running instances = 0). We’re using --max-node-provision-time=10m option, but new nodes in other ASGs (which had capacity) were not started in this time. The CAS was stuck in this state for over an hour.
We found the following in CAS logs:
...
I0501 12:50:26.228744 1 scale_up.go:264] Pod blah-r-ebbf1d5c18a2f81565-7489bfcc6c-7zxp2 is unschedulable
...
I0501 12:50:26.228818 1 scale_up.go:303] Upcoming 3 nodes
...
Pod cg-service-r-1885076a5de3ca3fee-55f44f94f4-tcvcr can't be scheduled on central-eks-ondemand-workers-asg-r4.4xlarge, predicate failed: PodFitsResources predicate mismatch, reason: Insufficient ephemeral-storage
...
I0501 12:50:26.228846 1 scale_up.go:341] Skipping node group central-eks-canary-asg-r5.12xlarge - max size reached
...
I0501 12:50:26.265802 1 scale_up.go:411] No need for any nodes in central-eks-workers-asg-m4.16xlarge
I0501 12:50:26.359310 1 scale_up.go:411] No need for any nodes in central-eks-workers-asg-m5.12xlarge
...
W0501 12:50:26.359343 1 scale_up.go:325] Node group central-eks-workers-asg-m5a.12xlarge is not ready for scaleup - unhealthy
As you can see, we had some groups that couldn’t be used (unhealthy were the once where AWS had no spot capacity), but some of them were perfectly fine ("the ones with “No need for any nodes…”).
This situation is hard to reproduce, but can someone please help me review the code?
If an instance is not coming up for more than MaxNodeProvisonTime, it is added to the LongUnregistered counter here.
Then, when calculating upcoming nodes, LongUnregistered, is subtracted, as can be seen in clusterstate.go, which is used in scale_up.go.
In our case newNodes := ar.CurrentTarget - (readiness.Ready + readiness.Unready + readiness.LongNotStarted + readiness.LongUnregistered) should correctly evaluate to 0, which means no new nodes are coming.
Yet, our logs show “3 upcoming nodes”.
Is it possible, that we had CurrentTarget = X, Ready = Unready = LongNotStarted = LongUnregistered = 0, which set the value to 0, but later it was updated to CurrentTarget = X, Ready = Unready = 0, LongNotStarted = LongUnregistered = X, the result was negative and the check prevent the counter from updating?
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 6
- Comments: 20 (13 by maintainers)
This is still an issue, even in the latest version of the cluster autoscaler, v1.21.0
Aaah, that hint is gold, thanks!
The other flags should be fine - those are commonly used options (I guess your expander is not widely used yet 😃 ).
I think I figured it out though and it’s not the flags. The handling of unregistered nodes assumes some representation of non-registered instances is returned by NodeGroup.Nodes(). If it’s not the nodes will never come up as LongUnregistered, because that’s tracked individually for each identifier returned by NodeGroup.Nodes().
You can check how many nodes in each state (LongUnregistered, LongNotStarted, etc) you have in each NodeGroup by looking at CA status configmap (
kubectl get configmap cluster-autoscaler-status -o yaml -n kube-system). I bet you’ll find out that in your case you have no unregistered nodes.I can’t test this theory as I have no access to AWS, but I’m pretty sure that’s it. The way to fix it would be to change AWS cloudprovider so NodeGroup.Nodes() returns some representation for non-existing spot instances (note - those identifiers must later allow deleting the non-existing instances or just resizing the ASG back down with NodePool.DeleteInstances()).
cc: @Jeffwan @mvisonneau - I think the above is the proper way to fix the issue you’re trying to address with #1980.
edit: to clarify - you don’t need to implement InstanceStatus for the non-existing spots. The timeout-based error handling will kick in even if it’s always nil. You just need to return a cloudprovider.Instance with some unique Id for each non-existing instance (the Id must be consistent between the loops). I suspect something like “<asg-name>-not-created-1” through “<asg-name>-not-created-N” could do the trick.