kubernetes: Problems with Pod priority and preemtpion when using resource quotas

What happened:

The pod priority and preemption is not working when applying resource quotas to a namespace.

What you expected to happen:

High priority pods should run first and lower priority pods should be preempted to free enough resources for the higher priority pods to run.

How to reproduce it (as minimally and precisely as possible):

Run a local cluster using Minikube: minikube start --kubernetes-version=v1.21.0 --driver=docker --force

Create high and low priority classes:

cat <<EOF | kubectl apply -f -
apiVersion: scheduling.k8s.io/v1
description: Low-priority Priority Class
kind: PriorityClass
metadata:
  name: low-priority
value: 100000
EOF

cat <<EOF | kubectl apply -f -
apiVersion: scheduling.k8s.io/v1
description: Low-priority Priority Class
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
EOF

Create a namespace with resource quotas:

kubectl create namespace limited-cpu
cat <<EOF | kubectl apply -n limited-cpu -f -
apiVersion: v1
kind: ResourceQuota
metadata:
  name: limit-max-cpu
spec:
  hard:
    requests.cpu: "1000m"
EOF

Spawn 10 low priority jobs with a CPU request of 333m, so there will be a maximum of three of them running at the same time:

for  i  in  $(seq -w 1 10) ;
do
    cat <<EOF | kubectl apply -n limited-cpu -f -
apiVersion: batch/v1
kind: Job
metadata:
  name: low-priority-$i
spec:
  template:
    spec:
      containers:
      - name: low-priority-$i
        image: busybox
        command: ["sleep", "60s"]
        resources:
          requests:
            memory: "64Mi"
            cpu: "333m"
      restartPolicy: Never
      priorityClassName: "low-priority"
      terminationGracePeriodSeconds: 10
EOF
done

Wait 20 seconds and spawn a high priority job with a CPU request of 500m:

cat <<EOF | kubectl apply -n limited-cpu -f -
apiVersion: batch/v1
kind: Job
metadata:
  name: high-priority-1
spec:
  template:
    spec:
      containers:
      - name: high-priority-1
        image: busybox
        command: ["sleep", "30s"]
        resources:
          requests:
          memory: "128Mi"
          cpu: "500m"
      restartPolicy: Never
      priorityClassName: "high-priority"
EOF

Two low priority jobs should be stopped to let the high priority one run. Instead, all the low priority jobs are run first and the high priority one is last.

Anything else we need to know?:

When defining a pod limit with Minikube instead of using resource quotas it works: minikube start --kubernetes-version=v1.21.0 --driver=docker --force --extra-config=kubelet.max-pods=10

Then, create the 10 low priority jobs on the default namespace, no need for hardware requests. Because Minikube already has a few pods running, not all those jobs can be run at the same time.

Wait 20 seconds again and spawn a high priority job. This time, a low priority job is terminated to let the high priority one take its place.

Environment:

Kubernetes version (use kubectl version): 1.21
Cloud provider or hardware configuration: Docker in Docker container
OS (e.g: cat /etc/os-release): Alpine Linux .13.5 (Docker image)
Kernel (e.g. uname -a): Linux e8701ccb89d7 3.10.0-1160.24.1.el7.x86_64
Install tools: install from https://dl.k8s.io/release/v1.21.0/bin/linux/amd64/kubectl and https://storage.googleapis.com/minikube/releases/v1.22.0/minikube-linux-amd64
Others: Kubernetes and Minikube running in a Docker container (docker:dind image)

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 15 (8 by maintainers)

Most upvoted comments

As @yuchen-sun said, ElasticQuota maybe able to solve your issue. You can set Max in ElasticQuota. And the max is the upper bound of the resource consumption of the namespace during scheduling, so it supports priority preemption. High priority jobs will give priority to consume resources, but the total amount of resources cannot exceed max. @gageffroy /cc @ahg-g

denkensk on Aug 30, 2021