autoscaler: Autoscaler exits after failing to fix node group size (AWS EKS)
I have an AWS EKS cluster (1.13.8) and I’m running the master build of the autoscaler. I use spot instance ASGs and eu-central-1c is rejecting my requests for a r5.xlarge (which is fine, to be expected). But the autoscaler does not appear to be handling it gracefully. I have 86 restarts on the autoscaler component and the k logs --previous for the autoscaler pod show this before exiting:
I0829 14:14:00.057299 1 static_autoscaler.go:192] Starting main loop
I0829 14:14:00.058057 1 utils.go:464] Removing autoscaler soft taint when creating template from node ip-172-30-60-142.eu-central-1.compute.internal
I0829 14:14:00.058200 1 utils.go:464] Removing autoscaler soft taint when creating template from node ip-172-30-25-34.eu-central-1.compute.internal
I0829 14:14:00.058359 1 utils.go:464] Removing autoscaler soft taint when creating template from node ip-172-30-73-124.eu-central-1.compute.internal
I0829 14:14:00.058902 1 utils.go:464] Removing autoscaler soft taint when creating template from node ip-172-30-42-66.eu-central-1.compute.internal
W0829 14:14:00.249993 1 clusterstate.go:585] Failed to get nodegroup for aws:///eu-central-1c/i-placeholder-K3-EKS-spotr5xlasgsubnet02af43b02922e710f-10QH9H0C8PG7O-13: wrong id: expected format aws:///<zone>/<name>, got aws:///eu-central-1c/i-placeholder-K3-EKS-spotr5xlasgsubnet02af43b02922e710f-10QH9H0C8PG7O-13
W0829 14:14:00.250032 1 clusterstate.go:585] Failed to get nodegroup for aws:///eu-central-1c/i-placeholder-K3-EKS-spotr5xlasgsubnet02af43b02922e710f-10QH9H0C8PG7O-14: wrong id: expected format aws:///<zone>/<name>, got aws:///eu-central-1c/i-placeholder-K3-EKS-spotr5xlasgsubnet02af43b02922e710f-10QH9H0C8PG7O-14
W0829 14:14:00.250047 1 clusterstate.go:585] Failed to get nodegroup for aws:///eu-central-1c/i-placeholder-K3-EKS-spotr5xlasgsubnet02af43b02922e710f-10QH9H0C8PG7O-15: wrong id: expected format aws:///<zone>/<name>, got aws:///eu-central-1c/i-placeholder-K3-EKS-spotr5xlasgsubnet02af43b02922e710f-10QH9H0C8PG7O-15
I0829 14:14:00.250111 1 static_autoscaler.go:260] 3 unregistered nodes present
I0829 14:14:00.250125 1 utils.go:491] Removing unregistered node aws:///eu-central-1c/i-placeholder-K3-EKS-spotr5xlasgsubnet02af43b02922e710f-10QH9H0C8PG7O-14
W0829 14:14:00.250136 1 utils.go:494] Failed to get node group for aws:///eu-central-1c/i-placeholder-K3-EKS-spotr5xlasgsubnet02af43b02922e710f-10QH9H0C8PG7O-14: wrong id: expected format aws:///<zone>/<name>, got aws:///eu-central-1c/i-placeholder-K3-EKS-spotr5xlasgsubnet02af43b02922e710f-10QH9H0C8PG7O-14
W0829 14:14:00.250154 1 static_autoscaler.go:265] Failed to remove unregistered nodes: wrong id: expected format aws:///<zone>/<name>, got aws:///eu-central-1c/i-placeholder-K3-EKS-spotr5xlasgsubnet02af43b02922e710f-10QH9H0C8PG7O-14
I0829 14:14:00.250178 1 utils.go:538] Decreasing size of K3-EKS-spotr5xlasgsubnet02af43b02922e710f-10QH9H0C8PG7O, expected=16 current=13 delta=-3
E0829 14:14:00.250190 1 static_autoscaler.go:287] Failed to fix node group sizes: failed to decrease K3-EKS-spotr5xlasgsubnet02af43b02922e710f-10QH9H0C8PG7O: attempt to delete existing nodes targetSize:16 delta:-3 existingNodes: 16
I0829 14:14:00.656225 1 reflector.go:385] k8s.io/client-go/informers/factory.go:133: Watch close - *v1.Node total 1765 items received
I0829 14:14:01.530893 1 main.go:275] Received signal, attempting cleanup
I0829 14:14:01.543011 1 main.go:277] Cleaned up, exiting...
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Reactions: 3
- Comments: 16 (10 by maintainers)
Commits related to this issue
- add unit tests validating AwsRefFromProviderId In Issue #2285, there are log lines demonstrating that the `AwsRefFromProviderId()` function in the AWS cloud provider was returning an error from the p... — committed to jaypipes/k8s-autoscaler by jaypipes 5 years ago
- add unit tests validating AwsRefFromProviderId In Issue #2285, there are log lines demonstrating that the `AwsRefFromProviderId()` function in the AWS cloud provider was returning an error from the p... — committed to aksentyev/autoscaler by jaypipes 5 years ago
- add unit tests validating AwsRefFromProviderId In Issue #2285, there are log lines demonstrating that the `AwsRefFromProviderId()` function in the AWS cloud provider was returning an error from the p... — committed to aksentyev/autoscaler by jaypipes 5 years ago
- add unit tests validating AwsRefFromProviderId In Issue #2285, there are log lines demonstrating that the `AwsRefFromProviderId()` function in the AWS cloud provider was returning an error from the p... — committed to piotrnosek/autoscaler by jaypipes 5 years ago
- add unit tests validating AwsRefFromProviderId In Issue #2285, there are log lines demonstrating that the `AwsRefFromProviderId()` function in the AWS cloud provider was returning an error from the p... — committed to arisechurch/autoscaler by jaypipes 5 years ago
Thanks! If you have a look here: https://github.com/kubernetes/autoscaler/blob/cluster-autoscaler-1.14.5/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go and here: https://github.com/kubernetes/autoscaler/blob/cluster-autoscaler-1.14.6/cluster-autoscaler/cloudprovider/aws/aws_cloud_provider.go
As you can see, the fix I made to that placeholder instance validation that @jaypipes mentioned above didn’t make it into a release until 1.14.6. I believe simply upgrading to a later release in the 1.14 branch should resolve your problem. Let us know if it doesn’t.
We have run into a problem which is similar to this and #790. When scaling up while hitting the spot instance limit for that region, and a bunch of metadata for placeholder nodes got created and never attached to an instance. These entries couldn’t get cleaned up because the cluster-autoscaler kept erroring out, saying
Eventually, we get the
Failed to remove unregistered nodesandFailed to fix nodegroup sizeslog messages followed by the exit/restart.