volcano: Failed to launch mpijob after installing volcano

Hi everyone, I am trying to use the gang-scheduler in my k8s/kubeflow cluster and installed volcano following the tutorial here and here.

$ kubectl get all -n volcano-system 
NAME                                       READY   STATUS      RESTARTS   AGE
pod/volcano-admission-5bd5756f79-5rxkh     1/1     Running     0          24h
pod/volcano-admission-init-nf2mc           0/1     Completed   0          24h
pod/volcano-controllers-687948d9c8-xclv7   1/1     Running     0          24h
pod/volcano-scheduler-79f569766f-bxgnf     1/1     Running     0          24h


NAME                                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
service/volcano-admission-service   ClusterIP   10.107.67.206   <none>        443/TCP   24h


NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/volcano-admission     1/1     1            1           24h
deployment.apps/volcano-controllers   1/1     1            1           24h
deployment.apps/volcano-scheduler     1/1     1            1           24h

NAME                                             DESIRED   CURRENT   READY   AGE
replicaset.apps/volcano-admission-5bd5756f79     1         1         1       24h
replicaset.apps/volcano-controllers-687948d9c8   1         1         1       24h
replicaset.apps/volcano-scheduler-79f569766f     1         1         1       24h



NAME                               COMPLETIONS   DURATION   AGE
job.batch/volcano-admission-init   1/1           24s        24h

However, some error messages came up when I launched the mpijob. It seems the job queue is not working properly.

$ kubectl logs -n volcano-system volcano-controllers-687948d9c8-xclv7 --tail 10                                                                                             
I0917 02:26:57.418937       1 queue_controller.go:158] Begin sync queue default
I0917 02:26:57.418960       1 queue_controller.go:133] Error syncing queues "default", retrying. Error: queue default has not been seen or deleted
I0917 02:43:37.419076       1 queue_controller.go:158] Begin sync queue default
I0917 02:43:37.419106       1 queue_controller.go:133] Error syncing queues "default", retrying. Error: queue default has not been seen or deleted
I0917 03:00:17.419234       1 queue_controller.go:158] Begin sync queue default
I0917 03:00:17.419268       1 queue_controller.go:133] Error syncing queues "default", retrying. Error: queue default has not been seen or deleted
I0917 03:16:57.419408       1 queue_controller.go:158] Begin sync queue default
I0917 03:16:57.419431       1 queue_controller.go:133] Error syncing queues "default", retrying. Error: queue default has not been seen or deleted
I0917 03:33:37.419563       1 queue_controller.go:158] Begin sync queue default
I0917 03:33:37.419590       1 queue_controller.go:133] Error syncing queues "default", retrying. Error: queue default has not been seen or deleted

The pods are all in “Pending” state

$ kubectl get pods                 
NAME                                      READY   STATUS    RESTARTS   AGE
mxnet-horovod-job-launcher-7pncv          0/1     Pending   0          159m
mxnet-horovod-job-worker-0                0/1     Pending   0          159m
mxnet-horovod-job-worker-1                0/1     Pending   0          159m
mxnet-horovod-job-worker-2                0/1     Pending   0          159m
mxnet-horovod-job-worker-3                0/1     Pending   0          159m

The output of the volcano-scheduler is like below

$ kubectl logs -n volcano-system volcano-scheduler-79f569766f-bxgnf --tail 20
I0917 03:38:21.543470       1 enqueue.go:75] Try to enqueue PodGroup to 0 Queues
I0917 03:38:21.543496       1 enqueue.go:122] Leaving Enqueue ...
I0917 03:38:21.543509       1 allocate.go:43] Enter Allocate ...
I0917 03:38:21.543523       1 allocate.go:94] Try to allocate resource to 0 Namespaces
I0917 03:38:21.543544       1 allocate.go:247] Leaving Allocate ...
I0917 03:38:21.543552       1 backfill.go:42] Enter Backfill ...
I0917 03:38:21.543562       1 backfill.go:91] Leaving Backfill ...
I0917 03:38:21.547705       1 session.go:154] Close Session 989f0526-d8fc-11e9-af2b-46b0d5a5c4cd
I0917 03:38:22.548180       1 cache.go:771] There are <1> Jobs, <1> Queues and <7> Nodes in total for scheduling.
I0917 03:38:22.548205       1 session.go:135] Open Session 99386113-d8fc-11e9-af2b-46b0d5a5c4cd with <1> Job and <1> Queues
I0917 03:38:22.548540       1 enqueue.go:43] Enter Enqueue ...
I0917 03:38:22.548553       1 enqueue.go:58] Added Queue <default> for Job <default/mxnet-horovod-job>
I0917 03:38:22.548564       1 enqueue.go:75] Try to enqueue PodGroup to 0 Queues
I0917 03:38:22.548593       1 enqueue.go:122] Leaving Enqueue ...
I0917 03:38:22.548606       1 allocate.go:43] Enter Allocate ...
I0917 03:38:22.548621       1 allocate.go:94] Try to allocate resource to 0 Namespaces
I0917 03:38:22.548642       1 allocate.go:247] Leaving Allocate ...
I0917 03:38:22.548651       1 backfill.go:42] Enter Backfill ...
I0917 03:38:22.548662       1 backfill.go:91] Leaving Backfill ...
I0917 03:38:22.552921       1 session.go:154] Close Session 99386113-d8fc-11e9-af2b-46b0d5a5c4cd

Really appreciate if someone can offer some help!

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 41 (11 by maintainers)

Most upvoted comments

@hzxuzhonghu Oh, I didn’t realize volcano supports such functions before. Maybe you can put more details or a doc url in README. Anyway, I will let you guys know if it works after trying it.