nomad-autoscaler: bug: scaleutils can select nodes that are down / not part of target AWS ASG

The current implementation of FilterNodes does not filter out nodes in the down state, which can lead to selecting a node for scaling in when it is both down but still marked eligible for some reason. In this situation, if a node that is already down (say, from a prior scale-in) is selected for termination from an AWS Autoscaling group, it will result in a scaling error because the node is likely no-longer a member of the ASG due to a prior termination. Depending on the node selector strategy, this can lead to reoccurring scaling errors until the node is eventually GC’ed or other operator intervention is taken.

We have been observing this on our clusters for the past couple days. I can’t explain the timing on why it just started suddenly, but there have been numerous occasions where one or more nodes have already been scaled-in and terminated by the autoscaler (thus removed from the target ASG), yet the autoscaler still selects them for scaling in on subsequent runs. Whats even more confusing is that the autoscaler is set to purge the nodes on scale-in, yet they still remain after being terminated and left eligible. Relevant target from scaling policy:

Policy target

target "aws-asg" {
  dry-run                = "false"
  aws_asg_name           = "nomad-client-foo-XXXXXX"
  node_class             = "foo"
  node_drain_deadline    = "1h"
  node_purge             = "true"
  node_selector_strategy = "least_busy"
}

Two possible solutions to this problem:

Remove nodes that are down when running FilterNodes. This is a naive approach that assumes node’s that are down are actually down and are not going to come back online (i.e. they were already terminated and are not suffering a network segmentation or similar).
Verify that a given node is still a member of a target ASG or similar before attempting to terminate the node. This presumably would make sense to include in poolID.IsPoolMember or similar. This solution would probably be more robust against scaling errors from attempting to terminate an instance in an ASG thats not actually part of it. The same logic likely applies to other cloud providers too.

Versions of things

Nomad: 1.0.4 Nomad Autoscaler: 0.3.2

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 16 (7 by maintainers)

Most upvoted comments

You guys should just update the docs to let people know this is a dead and broken project until such a time that you do prioritize it. Otherwise you’re just letting people waste their time debugging systems trying to deploy a project that doesn’t actually work until they find these hidden comments, or tweets from hashicorp members, saying it’s not really a priority.

If nomad, and components like this autoscaler, aren’t a priority then just say so. Doing otherwise trashes your reputation and frustrates your users.

tedivm on Oct 26, 2021

Here’s the entire log file from container start up until it got rotated: nomad_autoscaler_logs.txt. The snippet I sent you was from around line 7400 I beleive.

alexdulin on Apr 27, 2021