autoscaler: AWS - CA does not tolerate balance-similar-node-groups when ASG min and desired capacity is 0

There are 3 autoscaling groups tagged with k8s.io/cluster-autoscaler/SandboxEksCluster and k8s.io/cluster-autoscaler/enabled. SandboxEksCluster is my the name of my cluster.

Screenshot from 2019-10-31 10-33-09

Autoscaler is started up with the balance flag I1031 08:41:01.352107 1 flags.go:52] FLAG: --balance-similar-node-groups="true"

I am operating on simple nginx deployment kubectl run nginx --image=nginx --replicas=10

and scaling it accordingly so that new worker nodes should be added kubectl scale deployment/nginx --replicas xx

Each time CA picks a node from the most loaded ASG

I1031 09:00:36.572074       1 scale_up.go:263] Pod lky/nginx-7cdbd8cdc9-8zhq2 is unschedulable
I1031 09:00:36.572081       1 scale_up.go:263] Pod lky/nginx-7cdbd8cdc9-7kvwn is unschedulable
I1031 09:00:36.572089       1 scale_up.go:263] Pod lky/nginx-7cdbd8cdc9-skm9v is unschedulable
I1031 09:00:36.572095       1 scale_up.go:263] Pod lky/nginx-7cdbd8cdc9-c6hjx is unschedulable
I1031 09:00:36.572132       1 scale_up.go:300] Upcoming 0 nodes
I1031 09:00:36.572607       1 waste.go:57] Expanding Node Group sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1AASGA5AFF92F-1RZHULCJIW9D0 would waste 100.00% CPU, 100.00% Memory, 100.00% Blended
I1031 09:00:36.572624       1 waste.go:57] Expanding Node Group sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1BASG4E9ED460-PGHSCBA70FJW would waste 100.00% CPU, 100.00% Memory, 100.00% Blended
I1031 09:00:36.572632       1 waste.go:57] Expanding Node Group sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1CASG93B5A949-1V77ZIQFJJY7K would waste 100.00% CPU, 100.00% Memory, 100.00% Blended
I1031 09:00:36.572644       1 scale_up.go:423] Best option to resize: sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1AASGA5AFF92F-1RZHULCJIW9D0
I1031 09:00:36.572656       1 scale_up.go:427] Estimated 1 nodes needed in sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1AASGA5AFF92F-1RZHULCJIW9D0
I1031 09:00:36.572698       1 scale_up.go:529] Final scale-up plan: [{sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1AASGA5AFF92F-1RZHULCJIW9D0 2->3 (max: 5)}] 
I1031 09:00:36.572722       1 scale_up.go:694] Scale-up: setting group sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1AASGA5AFF92F-1RZHULCJIW9D0 size to 3 
I1031 09:00:36.572751       1 auto_scaling_groups.go:211] Setting asg sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1AASGA5AFF92F-1RZHULCJIW9D0 size to 3 

the same happens two times more until ASG reaches its maximum size of 5

I resized the deployment once again and this time CA picked a node from remaining ASGs as the previous is already full

I1031 10:05:37.255691       1 scale_up.go:263] Pod lky/nginx-7cdbd8cdc9-2mhvj is unschedulable
I1031 10:05:37.255696       1 scale_up.go:263] Pod lky/nginx-7cdbd8cdc9-lk7gx is unschedulable
I1031 10:05:37.255737       1 scale_up.go:300] Upcoming 0 nodes
I1031 10:05:37.255751       1 scale_up.go:338] Skipping node group sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1AASGA5AFF92F-1RZHULCJIW9D0 - max size reached
I1031 10:05:37.256404       1 waste.go:57] Expanding Node Group sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1BASG4E9ED460-PGHSCBA70FJW would waste 100.00% CPU, 100.00% Memory, 100.00% Blended
I1031 10:05:37.256420       1 waste.go:57] Expanding Node Group sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1CASG93B5A949-1V77ZIQFJJY7K would waste 100.00% CPU, 100.00% Memory, 100.00% Blended
I1031 10:05:37.256432       1 scale_up.go:423] Best option to resize: sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1BASG4E9ED460-PGHSCBA70FJW
I1031 10:05:37.256439       1 scale_up.go:427] Estimated 1 nodes needed in sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1BASG4E9ED460-PGHSCBA70FJW
I1031 10:05:37.256492       1 scale_up.go:521] Splitting scale-up between 2 similar node groups: {sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1BASG4E9ED460-PGHSCBA70FJW, sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1CASG93B5A949-1V77ZIQFJJY7K}
I1031 10:05:37.256503       1 scale_up.go:529] Final scale-up plan: [{sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1BASG4E9ED460-PGHSCBA70FJW 0->1 (max: 5)}]
I1031 10:05:37.256518       1 scale_up.go:694] Scale-up: setting group sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1BASG4E9ED460-PGHSCBA70FJW size to 1
I1031 10:05:37.256547       1 auto_scaling_groups.go:211] Setting asg sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1BASG4E9ED460-PGHSCBA70FJW size to 1

Repeating the scenario from the beginning with all 3 ASG having min set to 1

I1031 10:23:12.752831       1 scale_up.go:263] Pod lky/nginx-7cdbd8cdc9-wb9p8 is unschedulable
I1031 10:23:12.752847       1 scale_up.go:263] Pod lky/nginx-7cdbd8cdc9-g87s8 is unschedulable
I1031 10:23:12.752914       1 scale_up.go:300] Upcoming 0 nodes
I1031 10:23:12.753948       1 waste.go:57] Expanding Node Group sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1AASGA5AFF92F-1RZHULCJIW9D0 would waste 100.00% CPU, 100.00% Memory, 100.00% Blended
I1031 10:23:12.754003       1 waste.go:57] Expanding Node Group sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1BASG4E9ED460-PGHSCBA70FJW would waste 100.00% CPU, 100.00% Memory, 100.00% Blended
I1031 10:23:12.754026       1 waste.go:57] Expanding Node Group sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1CASG93B5A949-1V77ZIQFJJY7K would waste 100.00% CPU, 100.00% Memory, 100.00% Blended
I1031 10:23:12.754048       1 scale_up.go:423] Best option to resize: sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1BASG4E9ED460-PGHSCBA70FJW
I1031 10:23:12.754080       1 scale_up.go:427] Estimated 1 nodes needed in sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1BASG4E9ED460-PGHSCBA70FJW
I1031 10:23:12.754185       1 scale_up.go:521] Splitting scale-up between 3 similar node groups: {sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1BASG4E9ED460-PGHSCBA70FJW, sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1AASGA5AFF92F-1RZHULCJIW9D0, sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1CASG93B5A949-1V77ZIQFJJY7K}
I1031 10:23:12.754221       1 scale_up.go:529] Final scale-up plan: [{sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1BASG4E9ED460-PGHSCBA70FJW 1->2 (max: 5)}]
I1031 10:23:12.754249       1 scale_up.go:694] Scale-up: setting group sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1BASG4E9ED460-PGHSCBA70FJW size to 2
I1031 10:23:12.754298       1 auto_scaling_groups.go:211] Setting asg sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1BASG4E9ED460-PGHSCBA70FJW size to 2

Please note the log entry: I1031 10:23:12.754185 1 scale_up.go:521] Splitting scale-up between 3 similar node groups: {sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1BASG4E9ED460-PGHSCBA70FJW, sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1AASGA5AFF92F-1RZHULCJIW9D0, sandbox-eks-cluster-SandboxEksClusterEksClusterAsgUsEast1CASG93B5A949-1V77ZIQFJJY7K}

which could indicate CA really respects --balance-similar-node-groups property.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 5
  • Comments: 29 (10 by maintainers)

Most upvoted comments

I’ve just run into this with our clusters. Even with the random expander, all nodes were placed within a single AZ rather than distributed between ASGs in three. Setting asg_min_size / asg_desired_capacity to 1 causes CA to distribute further nodes across the three ASGs correctly, but it means we need to keep a baseline of 3 instances running when at times we could go lower.

Seriously, for everyone who uses K8S autoscaler in AWS - try Karpenter (https://karpenter.sh/)

Hello,

any updates on it?