kubernetes: Pods not being evenly scheduled across worker nodes

What happened:

After a straightforward scale test consisting of creating several hundreds of standalone pods (sleep) on a small-size cluster (9 worker nodes) I realized that the pods are not evenly scheduled across the nodes.

The test was executed w/o any limitRange and the created pods don’t have any requests either.

What you expected to happen:

Pods are evenly spread across all worker nodes.

How to reproduce it (as minimally and precisely as possible):

Number of pods in nodes before executing the test:

$ kubectl get nodes
NAME                                         STATUS   ROLES    AGE     VERSION
ip-10-0-134-116.eu-west-3.compute.internal   Ready    master   10d     v1.22.0-rc.0+75ee307
ip-10-0-139-16.eu-west-3.compute.internal    Ready    worker   7h48m   v1.22.0-rc.0+75ee307
ip-10-0-146-3.eu-west-3.compute.internal     Ready    worker   7h48m   v1.22.0-rc.0+75ee307
ip-10-0-156-89.eu-west-3.compute.internal    Ready    worker   10d     v1.22.0-rc.0+75ee307
ip-10-0-168-121.eu-west-3.compute.internal   Ready    worker   7h48m   v1.22.0-rc.0+75ee307
ip-10-0-182-174.eu-west-3.compute.internal   Ready    worker   7h47m   v1.22.0-rc.0+75ee307
ip-10-0-187-122.eu-west-3.compute.internal   Ready    worker   10d     v1.22.0-rc.0+75ee307
ip-10-0-187-21.eu-west-3.compute.internal    Ready    master   10d     v1.22.0-rc.0+75ee307
ip-10-0-199-68.eu-west-3.compute.internal    Ready    worker   3d10h   v1.22.0-rc.0+75ee307
ip-10-0-210-1.eu-west-3.compute.internal     Ready    worker   7h48m   v1.22.0-rc.0+75ee307
ip-10-0-218-198.eu-west-3.compute.internal   Ready    worker   7h47m   v1.22.0-rc.0+75ee307
ip-10-0-223-121.eu-west-3.compute.internal   Ready    master   10d     v1.22.0-rc.0+75ee307
$ kubectl get pods -o go-template --template='{{range .items}}{{if eq .status.phase "Running"}}{{.spec.nodeName}}{{"\n"}}{{end}}{{end}}' --all-namespaces | awk '{nodes[$1]++ }                                           
END{ for (n in nodes) print n": "nodes[n]}'
ip-10-0-187-21.eu-west-3.compute.internal: 59 <- master node not schedulable
ip-10-0-139-16.eu-west-3.compute.internal: 23
ip-10-0-210-1.eu-west-3.compute.internal: 15
ip-10-0-146-3.eu-west-3.compute.internal: 14
ip-10-0-156-89.eu-west-3.compute.internal: 17
ip-10-0-134-116.eu-west-3.compute.internal: 35 <- master node not schedulable
ip-10-0-218-198.eu-west-3.compute.internal: 15
ip-10-0-168-121.eu-west-3.compute.internal: 14
ip-10-0-182-174.eu-west-3.compute.internal: 14
ip-10-0-199-68.eu-west-3.compute.internal: 15
ip-10-0-223-121.eu-west-3.compute.internal: 32<- master node not schedulable
ip-10-0-187-122.eu-west-3.compute.internal: 24  

Create 1000 pods: for i in {1..1000}; do kubectl run --image=k8s.gcr.io/pause sleep-${i}; done

Check Running pods per node:

$ kubectl get pods -o go-template --template='{{range .items}}{{if eq .status.phase "Running"}}{{.spec.nodeName}}{{"\n"}}{{end}}{{end}}' --all-namespaces | awk '{nodes[$1]++ }END{ for (n in nodes) print n": "nodes[n]}'
ip-10-0-187-21.eu-west-3.compute.internal: 59 <- master node not schedulable
ip-10-0-139-16.eu-west-3.compute.internal: 224
ip-10-0-210-1.eu-west-3.compute.internal: 78
ip-10-0-146-3.eu-west-3.compute.internal: 71
ip-10-0-156-89.eu-west-3.compute.internal: 250
ip-10-0-134-116.eu-west-3.compute.internal: 35 <- master node not schedulable
ip-10-0-218-198.eu-west-3.compute.internal: 76
ip-10-0-168-121.eu-west-3.compute.internal: 71
ip-10-0-182-174.eu-west-3.compute.internal: 75
ip-10-0-199-68.eu-west-3.compute.internal: 56
ip-10-0-223-121.eu-west-3.compute.internal: 32 <- master node not schedulable
ip-10-0-187-122.eu-west-3.compute.internal: 250

As shown above, some nodes ran out of room to execute more pods (max-pods is set to 250) while there’re other nodes with much fewer pods

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"6b1d87acf3c8253c123756b9e61dac642678305f", GitTreeState:"archive", BuildDate:"2021-03-30T00:00:00Z", GoVersion:"go1.16", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.0-rc.0+75ee307", GitCommit:"75ee3073266f07baaba5db004cde0636425737cf", GitTreeState:"clean", BuildDate:"2021-09-04T12:16:28Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
AWS using m5.xlarge worker nodes

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 31 (25 by maintainers)

Most upvoted comments

Yes, I am suggesting we treat the balanced score differently from the others. As I mentioned above, reducing the values will basically shift the problem not solve it.

but it might be harder to reason about how the scores play together.

In a sense balanced serves a different purpose which is also evident from it not being part of the common score plugin we now have.