kubernetes: Problems with Pod priority and preemtpion when using resource quotas
What happened:
The pod priority and preemption is not working when applying resource quotas to a namespace.
What you expected to happen:
High priority pods should run first and lower priority pods should be preempted to free enough resources for the higher priority pods to run.
How to reproduce it (as minimally and precisely as possible):
Run a local cluster using Minikube:
minikube start --kubernetes-version=v1.21.0 --driver=docker --force
Create high and low priority classes:
cat <<EOF | kubectl apply -f -
apiVersion: scheduling.k8s.io/v1
description: Low-priority Priority Class
kind: PriorityClass
metadata:
name: low-priority
value: 100000
EOF
cat <<EOF | kubectl apply -f -
apiVersion: scheduling.k8s.io/v1
description: Low-priority Priority Class
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
EOF
Create a namespace with resource quotas:
kubectl create namespace limited-cpu
cat <<EOF | kubectl apply -n limited-cpu -f -
apiVersion: v1
kind: ResourceQuota
metadata:
name: limit-max-cpu
spec:
hard:
requests.cpu: "1000m"
EOF
Spawn 10 low priority jobs with a CPU request of 333m, so there will be a maximum of three of them running at the same time:
for i in $(seq -w 1 10) ;
do
cat <<EOF | kubectl apply -n limited-cpu -f -
apiVersion: batch/v1
kind: Job
metadata:
name: low-priority-$i
spec:
template:
spec:
containers:
- name: low-priority-$i
image: busybox
command: ["sleep", "60s"]
resources:
requests:
memory: "64Mi"
cpu: "333m"
restartPolicy: Never
priorityClassName: "low-priority"
terminationGracePeriodSeconds: 10
EOF
done
Wait 20 seconds and spawn a high priority job with a CPU request of 500m:
cat <<EOF | kubectl apply -n limited-cpu -f -
apiVersion: batch/v1
kind: Job
metadata:
name: high-priority-1
spec:
template:
spec:
containers:
- name: high-priority-1
image: busybox
command: ["sleep", "30s"]
resources:
requests:
memory: "128Mi"
cpu: "500m"
restartPolicy: Never
priorityClassName: "high-priority"
EOF
Two low priority jobs should be stopped to let the high priority one run. Instead, all the low priority jobs are run first and the high priority one is last.
Anything else we need to know?:
When defining a pod limit with Minikube instead of using resource quotas it works:
minikube start --kubernetes-version=v1.21.0 --driver=docker --force --extra-config=kubelet.max-pods=10
Then, create the 10 low priority jobs on the default namespace, no need for hardware requests. Because Minikube already has a few pods running, not all those jobs can be run at the same time.
Wait 20 seconds again and spawn a high priority job. This time, a low priority job is terminated to let the high priority one take its place.
Environment:
- Kubernetes version (use
kubectl version
): 1.21 - Cloud provider or hardware configuration: Docker in Docker container
- OS (e.g:
cat /etc/os-release
): Alpine Linux .13.5 (Docker image) - Kernel (e.g.
uname -a
): Linux e8701ccb89d7 3.10.0-1160.24.1.el7.x86_64 - Install tools: install from https://dl.k8s.io/release/v1.21.0/bin/linux/amd64/kubectl and https://storage.googleapis.com/minikube/releases/v1.22.0/minikube-linux-amd64
- Others: Kubernetes and Minikube running in a Docker container (docker:dind image)
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 15 (8 by maintainers)
As @yuchen-sun said, ElasticQuota maybe able to solve your issue. You can set Max in ElasticQuota. And the max is the upper bound of the resource consumption of the namespace during scheduling, so it supports priority preemption. High priority jobs will give priority to consume resources, but the total amount of resources cannot exceed max. @gageffroy /cc @ahg-g