autoscaler: Autoscaler exits after failing to fix node group size (AWS EKS)

I have an AWS EKS cluster (1.13.8) and I’m running the master build of the autoscaler. I use spot instance ASGs and eu-central-1c is rejecting my requests for a r5.xlarge (which is fine, to be expected). But the autoscaler does not appear to be handling it gracefully. I have 86 restarts on the autoscaler component and the k logs --previous for the autoscaler pod show this before exiting:

I0829 14:14:00.057299       1 static_autoscaler.go:192] Starting main loop
I0829 14:14:00.058057       1 utils.go:464] Removing autoscaler soft taint when creating template from node ip-172-30-60-142.eu-central-1.compute.internal
I0829 14:14:00.058200       1 utils.go:464] Removing autoscaler soft taint when creating template from node ip-172-30-25-34.eu-central-1.compute.internal
I0829 14:14:00.058359       1 utils.go:464] Removing autoscaler soft taint when creating template from node ip-172-30-73-124.eu-central-1.compute.internal
I0829 14:14:00.058902       1 utils.go:464] Removing autoscaler soft taint when creating template from node ip-172-30-42-66.eu-central-1.compute.internal
W0829 14:14:00.249993       1 clusterstate.go:585] Failed to get nodegroup for aws:///eu-central-1c/i-placeholder-K3-EKS-spotr5xlasgsubnet02af43b02922e710f-10QH9H0C8PG7O-13: wrong id: expected format aws:///<zone>/<name>, got aws:///eu-central-1c/i-placeholder-K3-EKS-spotr5xlasgsubnet02af43b02922e710f-10QH9H0C8PG7O-13
W0829 14:14:00.250032       1 clusterstate.go:585] Failed to get nodegroup for aws:///eu-central-1c/i-placeholder-K3-EKS-spotr5xlasgsubnet02af43b02922e710f-10QH9H0C8PG7O-14: wrong id: expected format aws:///<zone>/<name>, got aws:///eu-central-1c/i-placeholder-K3-EKS-spotr5xlasgsubnet02af43b02922e710f-10QH9H0C8PG7O-14
W0829 14:14:00.250047       1 clusterstate.go:585] Failed to get nodegroup for aws:///eu-central-1c/i-placeholder-K3-EKS-spotr5xlasgsubnet02af43b02922e710f-10QH9H0C8PG7O-15: wrong id: expected format aws:///<zone>/<name>, got aws:///eu-central-1c/i-placeholder-K3-EKS-spotr5xlasgsubnet02af43b02922e710f-10QH9H0C8PG7O-15
I0829 14:14:00.250111       1 static_autoscaler.go:260] 3 unregistered nodes present
I0829 14:14:00.250125       1 utils.go:491] Removing unregistered node aws:///eu-central-1c/i-placeholder-K3-EKS-spotr5xlasgsubnet02af43b02922e710f-10QH9H0C8PG7O-14
W0829 14:14:00.250136       1 utils.go:494] Failed to get node group for aws:///eu-central-1c/i-placeholder-K3-EKS-spotr5xlasgsubnet02af43b02922e710f-10QH9H0C8PG7O-14: wrong id: expected format aws:///<zone>/<name>, got aws:///eu-central-1c/i-placeholder-K3-EKS-spotr5xlasgsubnet02af43b02922e710f-10QH9H0C8PG7O-14
W0829 14:14:00.250154       1 static_autoscaler.go:265] Failed to remove unregistered nodes: wrong id: expected format aws:///<zone>/<name>, got aws:///eu-central-1c/i-placeholder-K3-EKS-spotr5xlasgsubnet02af43b02922e710f-10QH9H0C8PG7O-14
I0829 14:14:00.250178       1 utils.go:538] Decreasing size of K3-EKS-spotr5xlasgsubnet02af43b02922e710f-10QH9H0C8PG7O, expected=16 current=13 delta=-3
E0829 14:14:00.250190       1 static_autoscaler.go:287] Failed to fix node group sizes: failed to decrease K3-EKS-spotr5xlasgsubnet02af43b02922e710f-10QH9H0C8PG7O: attempt to delete existing nodes targetSize:16 delta:-3 existingNodes: 16
I0829 14:14:00.656225       1 reflector.go:385] k8s.io/client-go/informers/factory.go:133: Watch close - *v1.Node total 1765 items received
I0829 14:14:01.530893       1 main.go:275] Received signal, attempting cleanup
I0829 14:14:01.543011       1 main.go:277] Cleaned up, exiting...

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 3
Comments: 16 (10 by maintainers)

Commits related to this issue

add unit tests validating AwsRefFromProviderId In Issue #2285, there are log lines demonstrating that the `AwsRefFromProviderId()` function in the AWS cloud provider was returning an error from the p... — committed to jaypipes/k8s-autoscaler by jaypipes 5 years ago
add unit tests validating AwsRefFromProviderId In Issue #2285, there are log lines demonstrating that the `AwsRefFromProviderId()` function in the AWS cloud provider was returning an error from the p... — committed to aksentyev/autoscaler by jaypipes 5 years ago
add unit tests validating AwsRefFromProviderId In Issue #2285, there are log lines demonstrating that the `AwsRefFromProviderId()` function in the AWS cloud provider was returning an error from the p... — committed to aksentyev/autoscaler by jaypipes 5 years ago
add unit tests validating AwsRefFromProviderId In Issue #2285, there are log lines demonstrating that the `AwsRefFromProviderId()` function in the AWS cloud provider was returning an error from the p... — committed to piotrnosek/autoscaler by jaypipes 5 years ago
add unit tests validating AwsRefFromProviderId In Issue #2285, there are log lines demonstrating that the `AwsRefFromProviderId()` function in the AWS cloud provider was returning an error from the p... — committed to arisechurch/autoscaler by jaypipes 5 years ago

Most upvoted comments

Thanks! If you have a look here: https://github.com/kubernetes/autoscaler/blob/cluster-autoscaler-1.14.5/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go and here: https://github.com/kubernetes/autoscaler/blob/cluster-autoscaler-1.14.6/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go

As you can see, the fix I made to that placeholder instance validation that @jaypipes mentioned above didn’t make it into a release until 1.14.6. I believe simply upgrading to a later release in the 1.14 branch should resolve your problem. Let us know if it doesn’t.

gjtempleton on Mar 23, 2020

We have run into a problem which is similar to this and #790. When scaling up while hitting the spot instance limit for that region, and a bunch of metadata for placeholder nodes got created and never attached to an instance. These entries couldn’t get cleaned up because the cluster-autoscaler kept erroring out, saying

E0322 20:29:26.782157       1 static_autoscaler.go:214] Failed to remove unregistered nodes: wrong id: expected format aws:///<zone>/<name>, got aws:///eu-west-1b/i-placeholder-igname.eu-west-1.domain.com-160

Eventually, we get the Failed to remove unregistered nodes and Failed to fix nodegroup sizes log messages followed by the exit/restart.

Nyefan on Mar 22, 2020