kueue: MPI job example cannot find /home/mpiuser/.ssh

I’m trying to reproduce the example here: https://github.com/kubernetes-sigs/kueue/blob/main/site/static/examples/sample-mpijob.yaml

And first I was doing it from Python, but have reproduced the same applying that YAML file. Basically, it isn’t able to find the directory for the .ssh at /home/mpiuser/.ssh

Events:
  Type     Reason     Age   From               Message
  ----     ------     ----  ----               -------
  Normal   Scheduled  3s    default-scheduler  Successfully assigned default/pi-launcher-kg2jp to kind-control-plane
  Normal   Pulled     3s    kubelet            Container image "mpioperator/mpi-pi:openmpi" already present on machine
  Warning  Failed     3s    kubelet            Error: cannot find volume "ssh-auth" to mount into container "mpi-launcher"

And as a result the launchers seem to terminate and then generate again, ad-infinitum! I am testing using Kind, and perhaps that might be related? Or it could be that a change to the MPI operator is out of sync with the example here. When I can get this working, I have a full example of doing this in Python to contribute here. Thank you!

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 16 (16 by maintainers)

Most upvoted comments

IIUC, the kueue-manager only launches controllers for CRDs (such as MPIJob) pre-installed in the cluster.

https://github.com/kubernetes-sigs/kueue/blob/b2e8c9d0632c25c75b3ee8dfeecdce2bb6037464/main.go#L323-L334