volcano: Failed to launch mpijob after installing volcano
Hi everyone, I am trying to use the gang-scheduler in my k8s/kubeflow cluster and installed volcano following the tutorial here and here.
$ kubectl get all -n volcano-system
NAME READY STATUS RESTARTS AGE
pod/volcano-admission-5bd5756f79-5rxkh 1/1 Running 0 24h
pod/volcano-admission-init-nf2mc 0/1 Completed 0 24h
pod/volcano-controllers-687948d9c8-xclv7 1/1 Running 0 24h
pod/volcano-scheduler-79f569766f-bxgnf 1/1 Running 0 24h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/volcano-admission-service ClusterIP 10.107.67.206 <none> 443/TCP 24h
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/volcano-admission 1/1 1 1 24h
deployment.apps/volcano-controllers 1/1 1 1 24h
deployment.apps/volcano-scheduler 1/1 1 1 24h
NAME DESIRED CURRENT READY AGE
replicaset.apps/volcano-admission-5bd5756f79 1 1 1 24h
replicaset.apps/volcano-controllers-687948d9c8 1 1 1 24h
replicaset.apps/volcano-scheduler-79f569766f 1 1 1 24h
NAME COMPLETIONS DURATION AGE
job.batch/volcano-admission-init 1/1 24s 24h
However, some error messages came up when I launched the mpijob. It seems the job queue is not working properly.
$ kubectl logs -n volcano-system volcano-controllers-687948d9c8-xclv7 --tail 10
I0917 02:26:57.418937 1 queue_controller.go:158] Begin sync queue default
I0917 02:26:57.418960 1 queue_controller.go:133] Error syncing queues "default", retrying. Error: queue default has not been seen or deleted
I0917 02:43:37.419076 1 queue_controller.go:158] Begin sync queue default
I0917 02:43:37.419106 1 queue_controller.go:133] Error syncing queues "default", retrying. Error: queue default has not been seen or deleted
I0917 03:00:17.419234 1 queue_controller.go:158] Begin sync queue default
I0917 03:00:17.419268 1 queue_controller.go:133] Error syncing queues "default", retrying. Error: queue default has not been seen or deleted
I0917 03:16:57.419408 1 queue_controller.go:158] Begin sync queue default
I0917 03:16:57.419431 1 queue_controller.go:133] Error syncing queues "default", retrying. Error: queue default has not been seen or deleted
I0917 03:33:37.419563 1 queue_controller.go:158] Begin sync queue default
I0917 03:33:37.419590 1 queue_controller.go:133] Error syncing queues "default", retrying. Error: queue default has not been seen or deleted
The pods are all in “Pending” state
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
mxnet-horovod-job-launcher-7pncv 0/1 Pending 0 159m
mxnet-horovod-job-worker-0 0/1 Pending 0 159m
mxnet-horovod-job-worker-1 0/1 Pending 0 159m
mxnet-horovod-job-worker-2 0/1 Pending 0 159m
mxnet-horovod-job-worker-3 0/1 Pending 0 159m
The output of the volcano-scheduler is like below
$ kubectl logs -n volcano-system volcano-scheduler-79f569766f-bxgnf --tail 20
I0917 03:38:21.543470 1 enqueue.go:75] Try to enqueue PodGroup to 0 Queues
I0917 03:38:21.543496 1 enqueue.go:122] Leaving Enqueue ...
I0917 03:38:21.543509 1 allocate.go:43] Enter Allocate ...
I0917 03:38:21.543523 1 allocate.go:94] Try to allocate resource to 0 Namespaces
I0917 03:38:21.543544 1 allocate.go:247] Leaving Allocate ...
I0917 03:38:21.543552 1 backfill.go:42] Enter Backfill ...
I0917 03:38:21.543562 1 backfill.go:91] Leaving Backfill ...
I0917 03:38:21.547705 1 session.go:154] Close Session 989f0526-d8fc-11e9-af2b-46b0d5a5c4cd
I0917 03:38:22.548180 1 cache.go:771] There are <1> Jobs, <1> Queues and <7> Nodes in total for scheduling.
I0917 03:38:22.548205 1 session.go:135] Open Session 99386113-d8fc-11e9-af2b-46b0d5a5c4cd with <1> Job and <1> Queues
I0917 03:38:22.548540 1 enqueue.go:43] Enter Enqueue ...
I0917 03:38:22.548553 1 enqueue.go:58] Added Queue <default> for Job <default/mxnet-horovod-job>
I0917 03:38:22.548564 1 enqueue.go:75] Try to enqueue PodGroup to 0 Queues
I0917 03:38:22.548593 1 enqueue.go:122] Leaving Enqueue ...
I0917 03:38:22.548606 1 allocate.go:43] Enter Allocate ...
I0917 03:38:22.548621 1 allocate.go:94] Try to allocate resource to 0 Namespaces
I0917 03:38:22.548642 1 allocate.go:247] Leaving Allocate ...
I0917 03:38:22.548651 1 backfill.go:42] Enter Backfill ...
I0917 03:38:22.548662 1 backfill.go:91] Leaving Backfill ...
I0917 03:38:22.552921 1 session.go:154] Close Session 99386113-d8fc-11e9-af2b-46b0d5a5c4cd
Really appreciate if someone can offer some help!
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 41 (11 by maintainers)
@hzxuzhonghu Oh, I didn’t realize volcano supports such functions before. Maybe you can put more details or a doc url in README. Anyway, I will let you guys know if it works after trying it.