volcano: Preemption not working properly for high priority job
What happened: Low-priority running jobs will not be preempted by pending high-priority jobs when resource is not enough.
What you expected to happen: Low-priority running job should be evicted and then high priority job starts running.
How to reproduce it (as minimally and precisely as possible): volcano-scheduler.conf
{
"volcano-scheduler.conf": "actions: \"enqueue, allocate, backfill, preempt\"
tiers:
- plugins:
- name: priority
- name: gang
- name: conformance
- plugins:
- name: drf
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
"
}
I created two priority class:
$kubectl get priorityClass -o wide
NAME VALUE GLOBAL-DEFAULT AGE
high-priority 1000000 false 4d5h
low-priority 10000 false 4d5h
system-cluster-critical 2000000000 false 168d
system-node-critical 2000001000 false 168d
and two jobs with different priority using default queue:
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: vc-low-job
namespace: preempt
spec:
minAvailable: 1
schedulerName: volcano
queue: default
priorityClassName: low-priority
policies:
- event: PodEvicted
action: RestartJob
tasks:
- replicas: 1
name: nginx
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
priorityClassName: low-priority
containers:
- command:
- sleep
- 10m
image: nginx:latest
name: nginx
resources:
requests:
cpu: 2
limits:
cpu: 2
restartPolicy: OnFailure
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: vc-high-job
namespace: preempt
spec:
minAvailable: 1
schedulerName: volcano
queue: default
priorityClassName: high-priority
policies:
- event: PodEvicted
action: RestartJob
tasks:
- replicas: 1
name: nginx
policies:
- event: TaskCompleted
action: CompleteJob
template:
spec:
priorityClassName: high-priority
containers:
- command:
- sleep
- 10m
image: nginx:latest
name: nginx
resources:
requests:
cpu: 2
limits:
cpu: 2
restartPolicy: OnFailure
I ran the case on Mac minikube, cpu should be more than 4.
I started with low priority job, the job was running properly. When I created high priority job, the phase of podgroup was stuck at InQueue:
$kubectl describe pg vc-high-job -n preempt
Name: vc-high-job
Namespace: preempt
Labels: <none>
Annotations: <none>
API Version: scheduling.volcano.sh/v1beta1
Kind: PodGroup
Metadata:
Creation Timestamp: 2022-02-22T08:19:20Z
Generation: 11
Managed Fields:
API Version: scheduling.volcano.sh/v1beta1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:annotations:
.:
f:kubectl.kubernetes.io/last-applied-configuration:
f:ownerReferences:
.:
k:{"uid":"a5b46c10-4f69-465c-a48a-c9f597992e2f"}:
f:spec:
.:
f:minMember:
f:minResources:
.:
f:cpu:
f:minTaskMember:
.:
f:nginx:
f:priorityClassName:
f:queue:
f:status:
Manager: vc-controller-manager
Operation: Update
Time: 2022-02-22T08:19:20Z
API Version: scheduling.volcano.sh/v1beta1
Fields Type: FieldsV1
fieldsV1:
f:status:
f:conditions:
f:phase:
Manager: vc-scheduler
Operation: Update
Time: 2022-02-22T08:19:21Z
Owner References:
API Version: batch.volcano.sh/v1alpha1
Block Owner Deletion: true
Controller: true
Kind: Job
Name: vc-high-job
UID: a5b46c10-4f69-465c-a48a-c9f597992e2f
Resource Version: 76666
UID: c3f3d5e1-16eb-4197-bd1d-d81caf01d879
Spec:
Min Member: 1
Min Resources:
Cpu: 2
Min Task Member:
Nginx: 1
Priority Class Name: high-priority
Queue: default
Status:
Conditions:
Last Transition Time: 2022-02-22T08:28:18Z
Message: 1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable
Reason: NotEnoughResources
Status: True
Transition ID: b64ec226-264c-447a-aca9-f61995efc277
Type: Unschedulable
Phase: Inqueue
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unschedulable 9m21s volcano 0/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable
Warning Unschedulable 4m21s (x299 over 9m20s) volcano 1/1 tasks in gang unschedulable: pod group is not ready, 1 Pending, 1 minAvailable; Pending: 1 Unschedulable
And the pod was pending:
k8s-pratice kubectl get pod -n preempt
NAME READY STATUS RESTARTS AGE
vc-high-job-nginx-0 0/1 Pending 0 89s
vc-low-job-nginx-0 1/1 Running 0 2m47s
Here is some logs on scheduler:
I0222 08:21:26.092579 1 session.go:168] Open Session b9ee403d-034c-4f5e-861c-fa2dc99462dc with <2> Job and <2> Queues
I0222 08:21:26.093523 1 enqueue.go:44] Enter Enqueue ...
I0222 08:21:26.094134 1 enqueue.go:78] Try to enqueue PodGroup to 0 Queues
I0222 08:21:26.094277 1 enqueue.go:103] Leaving Enqueue ...
I0222 08:21:26.094330 1 allocate.go:43] Enter Allocate ...
I0222 08:21:26.094363 1 allocate.go:96] Try to allocate resource to 1 Namespaces
I0222 08:21:26.094486 1 allocate.go:163] Try to allocate resource to Jobs in Namespace <preempt> Queue <default>
I0222 08:21:26.094622 1 allocate.go:197] Try to allocate resource to 1 tasks of Job <preempt/vc-high-job>
I0222 08:21:26.094853 1 proportion.go:299] Queue <default>: deserved <cpu 4000.00, memory 0.00>, allocated <cpu 2000.00, memory 0.00>, share <0.5>, underUsedResName [cpu]
I0222 08:21:26.094961 1 allocate.go:212] There are <1> nodes for Job <preempt/vc-high-job>
I0222 08:21:26.095204 1 predicate_helper.go:73] Predicates failed for task <preempt/vc-high-job-nginx-0> on node <minikube>: task preempt/vc-high-job-nginx-0 on node minikube fit failed: node(s) resource fit failed
I0222 08:21:26.095376 1 statement.go:354] Discarding operations ...
I0222 08:21:26.095400 1 allocate.go:163] Try to allocate resource to Jobs in Namespace <preempt> Queue <default>
I0222 08:21:26.095664 1 allocate.go:197] Try to allocate resource to 0 tasks of Job <preempt/vc-low-job>
I0222 08:21:26.095735 1 statement.go:380] Committing operations ...
I0222 08:21:26.095954 1 allocate.go:159] Namespace <preempt> have no queue, skip it
I0222 08:21:26.096006 1 allocate.go:283] Leaving Allocate ...
I0222 08:21:26.096197 1 backfill.go:40] Enter Backfill ...
I0222 08:21:26.096577 1 backfill.go:90] Leaving Backfill ...
I0222 08:21:26.096654 1 preempt.go:41] Enter Preempt ...
I0222 08:21:26.096805 1 preempt.go:63] Added Queue <default> for Job <preempt/vc-high-job>
I0222 08:21:26.097128 1 statement.go:380] Committing operations ...
I0222 08:21:26.097145 1 preempt.go:189] Leaving Preempt ...
I0222 08:21:26.098257 1 session.go:190] Close Session b9ee403d-034c-4f5e-861c-fa2dc99462dc
Anything else we need to know?:
Environment:
- Volcano Version: 1.5.0(latest)
- Kubernetes version (use
kubectl version
): Client Version: version.Info{Major:“1”, Minor:“22”, GitVersion:“v1.22.1”, GitCommit:“632ed300f2c34f6d6d15ca4cef3d3c7073412212”, GitTreeState:“clean”, BuildDate:“2021-08-19T15:38:26Z”, GoVersion:“go1.16.6”, Compiler:“gc”, Platform:“darwin/amd64”} Server Version: version.Info{Major:“1”, Minor:“22”, GitVersion:“v1.22.1”, GitCommit:“632ed300f2c34f6d6d15ca4cef3d3c7073412212”, GitTreeState:“clean”, BuildDate:“2021-08-19T15:39:34Z”, GoVersion:“go1.16.7”, Compiler:“gc”, Platform:“linux/amd64”} - Cloud provider or hardware configuration: minikube on macbook
- OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a
): Darwin macdeMacBook-Pro.local 20.5.0 Darwin Kernel Version 20.5.0: Sat May 8 05:10:33 PDT 2021; root:xnu-7195.121.3~9/RELEASE_X86_64 x86_64 - Install tools:
- Others:
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 1
- Comments: 15 (7 by maintainers)
There are several ways you could try to make it work depends on you situation:
volcano.sh/preemptable: "true"
proportion
plugin like #1772 you could move theovercommit
plugin to first tier to hide the effects ofproportion
plugingang
not permiting, you could try to move gang to the second tier or use the suggestion by https://github.com/volcano-sh/volcano/issues/2034#issuecomment-1049451521 , they work like the sameThe principle of 2 and 3 is that both
enqueue
andpreempt
action only consider the first tier result if they could select a victims set: https://github.com/volcano-sh/volcano/blob/42702f7179f3ce7796c5020d5f264bd8c6c2d948/pkg/scheduler/framework/session_plugins.go#L403-L407 https://github.com/volcano-sh/volcano/blob/42702f7179f3ce7796c5020d5f264bd8c6c2d948/pkg/scheduler/framework/session_plugins.go#L240-L243The final work config could be: