volcano: Preemption not working properly for high priority job
What happened: Low-priority running jobs will not be preempted by pending high-priority jobs when resource is not enough.
What you expected to happen: Low-priority running job should be evicted and then high priority job starts running.
How to reproduce it (as minimally and precisely as possible): volcano-scheduler.conf
{
"volcano-scheduler.conf": "actions: \"enqueue, allocate, backfill, preempt\"
tiers:
- plugins:
- name: priority
- name: gang
- name: conformance
- plugins:
- name: drf
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
"
}
I created two priority class:
$kubectl get priorityClass -o wide
NAME VALUE GLOBAL-DEFAULT AGE
high-priority 1000000 false 4d5h
low-priority 10000 false 4d5h
system-cluster-critical 2000000000 false 168d
system-node-critical 2000001000 false 168d
and two jobs with different priority using default queue:
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: vc-low-job
namespace: preempt
spec:
minAvailable: 1
schedulerName: volcano
queue: default
priorityClassName: low-priority
policies:
- event: PodEvicted
action: RestartJob
tasks:
- replicas: 1
name: nginx
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
priorityClassName: low-priority
containers:
- command:
- sleep
- 10m
image: nginx:latest
name: nginx
resources:
requests:
cpu: 2
limits:
cpu: 2
restartPolicy: OnFailure
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: vc-high-job
namespace: preempt
spec:
minAvailable: 1
schedulerName: volcano
queue: default
priorityClassName: high-priority
policies:
- event: PodEvicted
action: RestartJob
tasks:
- replicas: 1
name: nginx
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
priorityClassName: high-priority
containers:
- command:
- sleep
- 10m
image: nginx:latest
name: nginx
resources:
requests:
cpu: 2
limits:
cpu: 2
restartPolicy: OnFailure
I ran the case on Mac minikube, cpu should be more than 4.
I started with low priority job, the job was running properly. When I created high priority job, the phase of podgroup was stuck at InQueue:
$kubectl describe pg vc-high-job -n preempt
Name: vc-high-job
Namespace: preempt
Labels: <none>
Annotations: <none>
API Version: scheduling.volcano.sh/v1beta1
Kind: PodGroup
Metadata:
Creation Timestamp: 2022-02-22T08:19:20Z
Generation: 11
Managed Fields:
API Version: scheduling.volcano.sh/v1beta1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:kubectl.kubernetes.io/last-applied-configuration:
f:ownerReferences:
.:
k:{"uid":"a5b46c10-4f69-465c-a48a-c9f597992e2f"}:
f:spec:
.:
f:minMember:
f:minResources:
.:
f:cpu:
f:minTaskMember:
.:
f:nginx:
f:priorityClassName:
f:queue:
f:status:
Manager: vc-controller-manager
Operation: Update
Time: 2022-02-22T08:19:20Z
API Version: scheduling.volcano.sh/v1beta1
Fields Type: FieldsV1
fieldsV1:
f:status:
f:conditions:
f:phase:
Manager: vc-scheduler
Operation: Update
Time: 2022-02-22T08:19:21Z
Owner References:
API Version: batch.volcano.sh/v1alpha1
Block Owner Deletion: true
Controller: true
Kind: Job
Name: vc-high-job
UID: a5b46c10-4f69-465c-a48a-c9f597992e2f
Resource Version: 76666
UID: c3f3d5e1-16eb-4197-bd1d-d81caf01d879
Spec:
Min Member: 1
Min Resources:
Cpu: 2
Min Task Member:
Nginx: 1
Priority Class Name: high-priority
Queue: default
Status:
Conditions:
Last Transition Time: 2022-02-22T08:28:18Z
Message: 1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable
Reason: NotEnoughResources
Status: True
Transition ID: b64ec226-264c-447a-aca9-f61995efc277
Type: Unschedulable
Phase: Inqueue
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unschedulable 9m21s volcano 0/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable
Warning Unschedulable 4m21s (x299 over 9m20s) volcano 1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable
And the pod was pending:
k8s-pratice kubectl get pod -n preempt
NAME READY STATUS RESTARTS AGE
vc-high-job-nginx-0 0/1 Pending 0 89s
vc-low-job-nginx-0 1/1 Running 0 2m47s
Here is some logs on scheduler:
I0222 08:21:26.092579 1 session.go:168] Open Session b9ee403d-034c-4f5e-861c-fa2dc99462dc with <2> Job and <2> Queues
I0222 08:21:26.093523 1 enqueue.go:44] Enter Enqueue ...
I0222 08:21:26.094134 1 enqueue.go:78] Try to enqueue PodGroup to 0 Queues
I0222 08:21:26.094277 1 enqueue.go:103] Leaving Enqueue ...
I0222 08:21:26.094330 1 allocate.go:43] Enter Allocate ...
I0222 08:21:26.094363 1 allocate.go:96] Try to allocate resource to 1 Namespaces
I0222 08:21:26.094486 1 allocate.go:163] Try to allocate resource to Jobs in Namespace <preempt> Queue <default>
I0222 08:21:26.094622 1 allocate.go:197] Try to allocate resource to 1 tasks of Job <preempt/vc-high-job>
I0222 08:21:26.094853 1 proportion.go:299] Queue <default>: deserved <cpu 4000.00, memory 0.00>, allocated <cpu 2000.00, memory 0.00>, share <0.5>, underUsedResName [cpu]
I0222 08:21:26.094961 1 allocate.go:212] There are <1> nodes for Job <preempt/vc-high-job>
I0222 08:21:26.095204 1 predicate_helper.go:73] Predicates failed for task <preempt/vc-high-job-nginx-0> on node <minikube>: task preempt/vc-high-job-nginx-0 on node minikube fit failed: node(s) resource fit failed
I0222 08:21:26.095376 1 statement.go:354] Discarding operations ...
I0222 08:21:26.095400 1 allocate.go:163] Try to allocate resource to Jobs in Namespace <preempt> Queue <default>
I0222 08:21:26.095664 1 allocate.go:197] Try to allocate resource to 0 tasks of Job <preempt/vc-low-job>
I0222 08:21:26.095735 1 statement.go:380] Committing operations ...
I0222 08:21:26.095954 1 allocate.go:159] Namespace <preempt> have no queue, skip it
I0222 08:21:26.096006 1 allocate.go:283] Leaving Allocate ...
I0222 08:21:26.096197 1 backfill.go:40] Enter Backfill ...
I0222 08:21:26.096577 1 backfill.go:90] Leaving Backfill ...
I0222 08:21:26.096654 1 preempt.go:41] Enter Preempt ...
I0222 08:21:26.096805 1 preempt.go:63] Added Queue <default> for Job <preempt/vc-high-job>
I0222 08:21:26.097128 1 statement.go:380] Committing operations ...
I0222 08:21:26.097145 1 preempt.go:189] Leaving Preempt ...
I0222 08:21:26.098257 1 session.go:190] Close Session b9ee403d-034c-4f5e-861c-fa2dc99462dc
Anything else we need to know?:
Environment:
- Volcano Version: 1.5.0(latest)
- Kubernetes version (use
kubectl version): Client Version: version.Info{Major:“1”, Minor:“22”, GitVersion:“v1.22.1”, GitCommit:“632ed300f2c34f6d6d15ca4cef3d3c7073412212”, GitTreeState:“clean”, BuildDate:“2021-08-19T15:38:26Z”, GoVersion:“go1.16.6”, Compiler:“gc”, Platform:“darwin/amd64”} Server Version: version.Info{Major:“1”, Minor:“22”, GitVersion:“v1.22.1”, GitCommit:“632ed300f2c34f6d6d15ca4cef3d3c7073412212”, GitTreeState:“clean”, BuildDate:“2021-08-19T15:39:34Z”, GoVersion:“go1.16.7”, Compiler:“gc”, Platform:“linux/amd64”} - Cloud provider or hardware configuration: minikube on macbook
- OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a): Darwin macdeMacBook-Pro.local 20.5.0 Darwin Kernel Version 20.5.0: Sat May 8 05:10:33 PDT 2021; root:xnu-7195.121.3~9/RELEASE_X86_64 x86_64 - Install tools:
- Others:
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 15 (7 by maintainers)
There are several ways you could try to make it work depends on you situation:
volcano.sh/preemptable: "true"proportionplugin like #1772 you could move theovercommitplugin to first tier to hide the effects ofproportionplugingangnot permiting, you could try to move gang to the second tier or use the suggestion by https://github.com/volcano-sh/volcano/issues/2034#issuecomment-1049451521 , they work like the sameThe principle of 2 and 3 is that both
enqueueandpreemptaction only consider the first tier result if they could select a victims set: https://github.com/volcano-sh/volcano/blob/42702f7179f3ce7796c5020d5f264bd8c6c2d948/pkg/scheduler/framework/session_plugins.go#L403-L407 https://github.com/volcano-sh/volcano/blob/42702f7179f3ce7796c5020d5f264bd8c6c2d948/pkg/scheduler/framework/session_plugins.go#L240-L243The final work config could be: