training-operator: MPIJob doesn't support exitcode restartPolicy
- Add restart policy ExitCode for launcher.
- Delete one of the running worker, the launcher will be failed, exit code is 137.
- And then worker re-created, launcher never restarts.
laucher log:
[tensorflow-mnist-launcher:00001] Warning: could not find environment variable "LD_LIBRARY_PATH"
+ POD_NAME=tensorflow-mnist-worker-1
+ shift
+ /opt/kube/kubectl exec tensorflow-mnist-worker-1 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "3314941952" -mca ess_base_vpid 2 -mca ess_base_num_procs "3" -mca orte_node_regex "tensorflow-mnist-launcher,tensorflow-mnist-worker-[1:0-1]@0(3)" -mca orte_hnp_uri "3314941952.0;tcp://10.244.48.135:48777" -mca pml "ob1" -mca btl "^openib" -mca plm "rsh" --tree-spawn -mca orte_parent_uri "3314941952.0;tcp://10.244.48.135:48777" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=tensorflow-mnist-worker-0
+ shift
+ /opt/kube/kubectl exec tensorflow-mnist-worker-0 -- /bin/sh -c PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "3314941952" -mca ess_base_vpid 1 -mca ess_base_num_procs "3" -mca orte_node_regex "tensorflow-mnist-launcher,tensorflow-mnist-worker-[1:0-1]@0(3)" -mca orte_hnp_uri "3314941952.0;tcp://10.244.48.135:48777" -mca pml "ob1" -mca btl "^openib" -mca plm "rsh" --tree-spawn -mca orte_parent_uri "3314941952.0;tcp://10.244.48.135:48777" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated"
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.
HNP daemon : [[50582,0],0] on node tensorflow-mnist-launcher
Remote daemon: [[50582,0],1] on node tensorflow-mnist-worker-0
This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
command terminated with exit code 137
yaml:
apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
name: tensorflow-mnist
spec:
slotsPerWorker: 1
runPolicy:
cleanPodPolicy: Running
backoffLimit: 3
mpiReplicaSpecs:
Launcher:
restartPolicy: ExitCode
replicas: 1
template:
spec:
containers:
- image: horovod/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.5.0-py3.7-cpu
name: mpi
command:
- mpirun
args:
- -np
- "2"
- --allow-run-as-root
- -bind-to
- none
- -map-by
- slot
- -x
- LD_LIBRARY_PATH
- -x
- PATH
- -mca
- pml
- ob1
- -mca
- btl
- ^openib
- python
- /examples/tensorflow2_mnist.py
resources:
limits:
cpu: 1
memory: 2Gi
Worker:
restartPolicy: ExitCode
replicas: 2
template:
spec:
containers:
- image: horovod/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.5.0-py3.7-cpu
name: mpi
resources:
limits:
cpu: 2
memory: 4Gi
Does MPIJob support exitcode restart policy?
About this issue
- Original URL
- State: open
- Created a year ago
- Comments: 22 (17 by maintainers)
Let’s move the discussion about deprecation to a new issue #1906
@andreyvelich Sorry for late reply(Forgot to check the message.). I would like to work on this.
/assign I will do it this week.
We should wait for responses from other owners.