kubernetes: Pods not being evenly scheduled across worker nodes
What happened:
After a straightforward scale test consisting of creating several hundreds of standalone pods (sleep) on a small-size cluster (9 worker nodes) I realized that the pods are not evenly scheduled across the nodes.
The test was executed w/o any limitRange and the created pods don’t have any requests either.
What you expected to happen:
Pods are evenly spread across all worker nodes.
How to reproduce it (as minimally and precisely as possible):
Number of pods in nodes before executing the test:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-134-116.eu-west-3.compute.internal Ready master 10d v1.22.0-rc.0+75ee307
ip-10-0-139-16.eu-west-3.compute.internal Ready worker 7h48m v1.22.0-rc.0+75ee307
ip-10-0-146-3.eu-west-3.compute.internal Ready worker 7h48m v1.22.0-rc.0+75ee307
ip-10-0-156-89.eu-west-3.compute.internal Ready worker 10d v1.22.0-rc.0+75ee307
ip-10-0-168-121.eu-west-3.compute.internal Ready worker 7h48m v1.22.0-rc.0+75ee307
ip-10-0-182-174.eu-west-3.compute.internal Ready worker 7h47m v1.22.0-rc.0+75ee307
ip-10-0-187-122.eu-west-3.compute.internal Ready worker 10d v1.22.0-rc.0+75ee307
ip-10-0-187-21.eu-west-3.compute.internal Ready master 10d v1.22.0-rc.0+75ee307
ip-10-0-199-68.eu-west-3.compute.internal Ready worker 3d10h v1.22.0-rc.0+75ee307
ip-10-0-210-1.eu-west-3.compute.internal Ready worker 7h48m v1.22.0-rc.0+75ee307
ip-10-0-218-198.eu-west-3.compute.internal Ready worker 7h47m v1.22.0-rc.0+75ee307
ip-10-0-223-121.eu-west-3.compute.internal Ready master 10d v1.22.0-rc.0+75ee307
$ kubectl get pods -o go-template --template='{{range .items}}{{if eq .status.phase "Running"}}{{.spec.nodeName}}{{"\n"}}{{end}}{{end}}' --all-namespaces | awk '{nodes[$1]++ }
END{ for (n in nodes) print n": "nodes[n]}'
ip-10-0-187-21.eu-west-3.compute.internal: 59 <- master node not schedulable
ip-10-0-139-16.eu-west-3.compute.internal: 23
ip-10-0-210-1.eu-west-3.compute.internal: 15
ip-10-0-146-3.eu-west-3.compute.internal: 14
ip-10-0-156-89.eu-west-3.compute.internal: 17
ip-10-0-134-116.eu-west-3.compute.internal: 35 <- master node not schedulable
ip-10-0-218-198.eu-west-3.compute.internal: 15
ip-10-0-168-121.eu-west-3.compute.internal: 14
ip-10-0-182-174.eu-west-3.compute.internal: 14
ip-10-0-199-68.eu-west-3.compute.internal: 15
ip-10-0-223-121.eu-west-3.compute.internal: 32<- master node not schedulable
ip-10-0-187-122.eu-west-3.compute.internal: 24
Create 1000 pods:
for i in {1..1000}; do kubectl run --image=k8s.gcr.io/pause sleep-${i}; done
Check Running pods per node:
$ kubectl get pods -o go-template --template='{{range .items}}{{if eq .status.phase "Running"}}{{.spec.nodeName}}{{"\n"}}{{end}}{{end}}' --all-namespaces | awk '{nodes[$1]++ }END{ for (n in nodes) print n": "nodes[n]}'
ip-10-0-187-21.eu-west-3.compute.internal: 59 <- master node not schedulable
ip-10-0-139-16.eu-west-3.compute.internal: 224
ip-10-0-210-1.eu-west-3.compute.internal: 78
ip-10-0-146-3.eu-west-3.compute.internal: 71
ip-10-0-156-89.eu-west-3.compute.internal: 250
ip-10-0-134-116.eu-west-3.compute.internal: 35 <- master node not schedulable
ip-10-0-218-198.eu-west-3.compute.internal: 76
ip-10-0-168-121.eu-west-3.compute.internal: 71
ip-10-0-182-174.eu-west-3.compute.internal: 75
ip-10-0-199-68.eu-west-3.compute.internal: 56
ip-10-0-223-121.eu-west-3.compute.internal: 32 <- master node not schedulable
ip-10-0-187-122.eu-west-3.compute.internal: 250
As shown above, some nodes ran out of room to execute more pods (max-pods is set to 250) while there’re other nodes with much fewer pods
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version
):
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"6b1d87acf3c8253c123756b9e61dac642678305f", GitTreeState:"archive", BuildDate:"2021-03-30T00:00:00Z", GoVersion:"go1.16", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.0-rc.0+75ee307", GitCommit:"75ee3073266f07baaba5db004cde0636425737cf", GitTreeState:"clean", BuildDate:"2021-09-04T12:16:28Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
- Cloud provider or hardware configuration:
AWS using m5.xlarge worker nodes
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 31 (25 by maintainers)
Yes, I am suggesting we treat the balanced score differently from the others. As I mentioned above, reducing the values will basically shift the problem not solve it.
In a sense balanced serves a different purpose which is also evident from it not being part of the common score plugin we now have.