kubedl: [BUG] the DAGScheduling and GangScheduling(volcano) conflict in mpijob

What happened: The mpijob worker pods are pending, and there is no launcher pod

mpi-demo-worker-0             0/1     Pending     0          13s
mpi-demo-worker-1             0/1     Pending     0          13s

The events of worker pod are as follows

Events:
  Type     Reason            Age   From     Message
  ----     ------            ----  ----     -------
  Warning  FailedScheduling  65s   volcano  3/2 tasks in gang unschedulable: pod group is not ready, 2 Pending, 3 minAvailable.

I think the core reason is the DAGScheduling and GangScheduling(volcano) conflict in mpijob.

I can fix this problem by adding this args in the kubedl deployment.

- --feature-gates
- DAGScheduling=false

What you expected to happen:

No pending

How to reproduce it: enable DAGScheduling and GangScheduling(volcano) to run a mpijob

Anything else we need to know?:

Environment:

  • KubeDL version:
  • Kubernetes version (use kubectl version):
  • OS (e.g: cat /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 15 (7 by maintainers)

Most upvoted comments

@HeGaoYuan I post an issue and will refactor it soon https://github.com/kubedl-io/kubedl/issues/194