volcano: Preemption not working properly for high priority job

What happened: Low-priority running jobs will not be preempted by pending high-priority jobs when resource is not enough.

What you expected to happen: Low-priority running job should be evicted and then high priority job starts running.

How to reproduce it (as minimally and precisely as possible): volcano-scheduler.conf

{
	"volcano-scheduler.conf": "actions: \"enqueue, allocate, backfill, preempt\"
		tiers:
		- plugins:
		  - name: priority
		  - name: gang
		  - name: conformance
		- plugins:
		  - name: drf
		  - name: predicates
		  - name: proportion
		  - name: nodeorder
		  - name: binpack
		"
}

I created two priority class:

$kubectl get priorityClass -o wide
NAME                      VALUE        GLOBAL-DEFAULT   AGE
high-priority             1000000      false            4d5h
low-priority              10000        false            4d5h
system-cluster-critical   2000000000   false            168d
system-node-critical      2000001000   false            168d

and two jobs with different priority using default queue:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: vc-low-job
  namespace: preempt
spec:
  minAvailable: 1
  schedulerName: volcano
  queue: default
  priorityClassName: low-priority
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - replicas: 1
      name: nginx
      policies:
      - event: TaskCompleted
        action: CompleteJob
      template:
        spec:
          priorityClassName: low-priority
          containers:
            - command:
              - sleep
              - 10m
              image: nginx:latest
              name: nginx
              resources:
                requests:
                  cpu: 2
                limits:
                  cpu: 2
          restartPolicy: OnFailure
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: vc-high-job
  namespace: preempt
spec:
  minAvailable: 1
  schedulerName: volcano
  queue: default
  priorityClassName: high-priority
  policies:
    - event: PodEvicted
      action: RestartJob
  tasks:
    - replicas: 1
      name: nginx
      policies:
      - event: TaskCompleted
        action: CompleteJob
      template:
        spec:
          priorityClassName: high-priority
          containers:
            - command:
              - sleep
              - 10m
              image: nginx:latest
              name: nginx
              resources:
                requests:
                  cpu: 2
                limits:
                  cpu: 2
          restartPolicy: OnFailure

I ran the case on Mac minikube, cpu should be more than 4.

I started with low priority job, the job was running properly. When I created high priority job, the phase of podgroup was stuck at InQueue:

$kubectl describe pg vc-high-job -n preempt
Name:         vc-high-job
Namespace:    preempt
Labels:       <none>
Annotations:  <none>
API Version:  scheduling.volcano.sh/v1beta1
Kind:         PodGroup
Metadata:
  Creation Timestamp:  2022-02-22T08:19:20Z
  Generation:          11
  Managed Fields:
    API Version:  scheduling.volcano.sh/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:kubectl.kubernetes.io/last-applied-configuration:
        f:ownerReferences:
          .:
          k:{"uid":"a5b46c10-4f69-465c-a48a-c9f597992e2f"}:
      f:spec:
        .:
        f:minMember:
        f:minResources:
          .:
          f:cpu:
        f:minTaskMember:
          .:
          f:nginx:
        f:priorityClassName:
        f:queue:
      f:status:
    Manager:      vc-controller-manager
    Operation:    Update
    Time:         2022-02-22T08:19:20Z
    API Version:  scheduling.volcano.sh/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        f:conditions:
        f:phase:
    Manager:    vc-scheduler
    Operation:  Update
    Time:       2022-02-22T08:19:21Z
  Owner References:
    API Version:           batch.volcano.sh/v1alpha1
    Block Owner Deletion:  true
    Controller:            true
    Kind:                  Job
    Name:                  vc-high-job
    UID:                   a5b46c10-4f69-465c-a48a-c9f597992e2f
  Resource Version:        76666
  UID:                     c3f3d5e1-16eb-4197-bd1d-d81caf01d879
Spec:
  Min Member:  1
  Min Resources:
    Cpu:  2
  Min Task Member:
    Nginx:              1
  Priority Class Name:  high-priority
  Queue:                default
Status:
  Conditions:
    Last Transition Time:  2022-02-22T08:28:18Z
    Message:               1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable
    Reason:                NotEnoughResources
    Status:                True
    Transition ID:         b64ec226-264c-447a-aca9-f61995efc277
    Type:                  Unschedulable
  Phase:                   Inqueue
Events:
  Type     Reason         Age                      From     Message
  ----     ------         ----                     ----     -------
  Warning  Unschedulable  9m21s                    volcano  0/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable
  Warning  Unschedulable  4m21s (x299 over 9m20s)  volcano  1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable

And the pod was pending:

k8s-pratice kubectl get pod -n preempt
NAME                  READY   STATUS    RESTARTS   AGE
vc-high-job-nginx-0   0/1     Pending   0          89s
vc-low-job-nginx-0    1/1     Running   0          2m47s

Here is some logs on scheduler:

I0222 08:21:26.092579       1 session.go:168] Open Session b9ee403d-034c-4f5e-861c-fa2dc99462dc with <2> Job and <2> Queues
I0222 08:21:26.093523       1 enqueue.go:44] Enter Enqueue ...
I0222 08:21:26.094134       1 enqueue.go:78] Try to enqueue PodGroup to 0 Queues
I0222 08:21:26.094277       1 enqueue.go:103] Leaving Enqueue ...
I0222 08:21:26.094330       1 allocate.go:43] Enter Allocate ...
I0222 08:21:26.094363       1 allocate.go:96] Try to allocate resource to 1 Namespaces
I0222 08:21:26.094486       1 allocate.go:163] Try to allocate resource to Jobs in Namespace <preempt> Queue <default>
I0222 08:21:26.094622       1 allocate.go:197] Try to allocate resource to 1 tasks of Job <preempt/vc-high-job>
I0222 08:21:26.094853       1 proportion.go:299] Queue <default>: deserved <cpu 4000.00, memory 0.00>, allocated <cpu 2000.00, memory 0.00>, share <0.5>, underUsedResName [cpu]
I0222 08:21:26.094961       1 allocate.go:212] There are <1> nodes for Job <preempt/vc-high-job>
I0222 08:21:26.095204       1 predicate_helper.go:73] Predicates failed for task <preempt/vc-high-job-nginx-0> on node <minikube>: task preempt/vc-high-job-nginx-0 on node minikube fit failed: node(s) resource fit failed
I0222 08:21:26.095376       1 statement.go:354] Discarding operations ...
I0222 08:21:26.095400       1 allocate.go:163] Try to allocate resource to Jobs in Namespace <preempt> Queue <default>
I0222 08:21:26.095664       1 allocate.go:197] Try to allocate resource to 0 tasks of Job <preempt/vc-low-job>
I0222 08:21:26.095735       1 statement.go:380] Committing operations ...
I0222 08:21:26.095954       1 allocate.go:159] Namespace <preempt> have no queue, skip it
I0222 08:21:26.096006       1 allocate.go:283] Leaving Allocate ...
I0222 08:21:26.096197       1 backfill.go:40] Enter Backfill ...
I0222 08:21:26.096577       1 backfill.go:90] Leaving Backfill ...
I0222 08:21:26.096654       1 preempt.go:41] Enter Preempt ...
I0222 08:21:26.096805       1 preempt.go:63] Added Queue <default> for Job <preempt/vc-high-job>
I0222 08:21:26.097128       1 statement.go:380] Committing operations ...
I0222 08:21:26.097145       1 preempt.go:189] Leaving Preempt ...
I0222 08:21:26.098257       1 session.go:190] Close Session b9ee403d-034c-4f5e-861c-fa2dc99462dc

Anything else we need to know?:

Environment:

  • Volcano Version: 1.5.0(latest)
  • Kubernetes version (use kubectl version): Client Version: version.Info{Major:“1”, Minor:“22”, GitVersion:“v1.22.1”, GitCommit:“632ed300f2c34f6d6d15ca4cef3d3c7073412212”, GitTreeState:“clean”, BuildDate:“2021-08-19T15:38:26Z”, GoVersion:“go1.16.6”, Compiler:“gc”, Platform:“darwin/amd64”} Server Version: version.Info{Major:“1”, Minor:“22”, GitVersion:“v1.22.1”, GitCommit:“632ed300f2c34f6d6d15ca4cef3d3c7073412212”, GitTreeState:“clean”, BuildDate:“2021-08-19T15:39:34Z”, GoVersion:“go1.16.7”, Compiler:“gc”, Platform:“linux/amd64”}
  • Cloud provider or hardware configuration: minikube on macbook
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a): Darwin macdeMacBook-Pro.local 20.5.0 Darwin Kernel Version 20.5.0: Sat May 8 05:10:33 PDT 2021; root:xnu-7195.121.3~9/RELEASE_X86_64 x86_64
  • Install tools:
  • Others:

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Reactions: 1
  • Comments: 15 (7 by maintainers)

Most upvoted comments

There are several ways you could try to make it work depends on you situation:

  1. set the lp job to be preemptable by adding annotation: volcano.sh/preemptable: "true"
  2. if the hp job could not be enqueued caused by the proportion plugin like #1772 you could move the overcommit plugin to first tier to hide the effects of proportion plugin
  3. if the lp job could not be preempted caused by the the gang not permiting, you could try to move gang to the second tier or use the suggestion by https://github.com/volcano-sh/volcano/issues/2034#issuecomment-1049451521 , they work like the same

The principle of 2 and 3 is that both enqueue and preempt action only consider the first tier result if they could select a victims set: https://github.com/volcano-sh/volcano/blob/42702f7179f3ce7796c5020d5f264bd8c6c2d948/pkg/scheduler/framework/session_plugins.go#L403-L407 https://github.com/volcano-sh/volcano/blob/42702f7179f3ce7796c5020d5f264bd8c6c2d948/pkg/scheduler/framework/session_plugins.go#L240-L243

The final work config could be:

    actions: "enqueue, allocate, backfill, preempt"
    tiers:
    - plugins:
      - name: priority
      - name: conformance
      - name: overcommit
        arguments:
          overcommit-factor: 10.0
    - plugins:
      - name: drf
      - name: gang
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack