kueue: MPI job example cannot find /home/mpiuser/.ssh
I’m trying to reproduce the example here: https://github.com/kubernetes-sigs/kueue/blob/main/site/static/examples/sample-mpijob.yaml
And first I was doing it from Python, but have reproduced the same applying that YAML file. Basically, it isn’t able to find the directory for the .ssh at /home/mpiuser/.ssh
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3s default-scheduler Successfully assigned default/pi-launcher-kg2jp to kind-control-plane
Normal Pulled 3s kubelet Container image "mpioperator/mpi-pi:openmpi" already present on machine
Warning Failed 3s kubelet Error: cannot find volume "ssh-auth" to mount into container "mpi-launcher"
And as a result the launchers seem to terminate and then generate again, ad-infinitum! I am testing using Kind, and perhaps that might be related? Or it could be that a change to the MPI operator is out of sync with the example here. When I can get this working, I have a full example of doing this in Python to contribute here. Thank you!
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 16 (16 by maintainers)
IIUC, the kueue-manager only launches controllers for CRDs (such as MPIJob) pre-installed in the cluster.
https://github.com/kubernetes-sigs/kueue/blob/b2e8c9d0632c25c75b3ee8dfeecdce2bb6037464/main.go#L323-L334