training-operator: MPIJob doesn't support exitcode restartPolicy

Add restart policy ExitCode for launcher.
Delete one of the running worker, the launcher will be failed, exit code is 137.
And then worker re-created, launcher never restarts.

laucher log:

[tensorflow-mnist-launcher:00001] Warning: could not find environment variable "LD_LIBRARY_PATH"
+ POD_NAME=tensorflow-mnist-worker-1
+ shift
+ /opt/kube/kubectl exec tensorflow-mnist-worker-1 -- /bin/sh -c     PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "3314941952" -mca ess_base_vpid 2 -mca ess_base_num_procs "3" -mca orte_node_regex "tensorflow-mnist-launcher,tensorflow-mnist-worker-[1:0-1]@0(3)" -mca orte_hnp_uri "3314941952.0;tcp://10.244.48.135:48777" -mca pml "ob1" -mca btl "^openib" -mca plm "rsh" --tree-spawn -mca orte_parent_uri "3314941952.0;tcp://10.244.48.135:48777" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=tensorflow-mnist-worker-0
+ shift
+ /opt/kube/kubectl exec tensorflow-mnist-worker-0 -- /bin/sh -c     PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "3314941952" -mca ess_base_vpid 1 -mca ess_base_num_procs "3" -mca orte_node_regex "tensorflow-mnist-launcher,tensorflow-mnist-worker-[1:0-1]@0(3)" -mca orte_hnp_uri "3314941952.0;tcp://10.244.48.135:48777" -mca pml "ob1" -mca btl "^openib" -mca plm "rsh" --tree-spawn -mca orte_parent_uri "3314941952.0;tcp://10.244.48.135:48777" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated"
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[50582,0],0] on node tensorflow-mnist-launcher
  Remote daemon: [[50582,0],1] on node tensorflow-mnist-worker-0

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
command terminated with exit code 137

yaml:

apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  name: tensorflow-mnist
spec:
  slotsPerWorker: 1
  runPolicy:
    cleanPodPolicy: Running
    backoffLimit: 3
  mpiReplicaSpecs:
    Launcher:
      restartPolicy: ExitCode
      replicas: 1
      template:
        spec:
          containers:
          - image: horovod/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.5.0-py3.7-cpu
            name: mpi
            command:
            - mpirun
            args:
            - -np
            - "2"
            - --allow-run-as-root
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - python
            - /examples/tensorflow2_mnist.py
            resources:
              limits:
                cpu: 1
                memory: 2Gi
    Worker:
      restartPolicy: ExitCode
      replicas: 2
      template:
        spec:
          containers:
          - image: horovod/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.5.0-py3.7-cpu
            name: mpi
            resources:
              limits:
                cpu: 2
                memory: 4Gi

Does MPIJob support exitcode restart policy?

About this issue

Original URL
State: open
Created a year ago
Comments: 22 (17 by maintainers)

Most upvoted comments

Let’s move the discussion about deprecation to a new issue #1906

alculquicondor on Sep 7, 2023

Hi @Syulin7, did you get a chance to work on this ? Thank you for your contributions!

@andreyvelich Sorry for late reply(Forgot to check the message.). I would like to work on this.

Syulin7 on Sep 5, 2023

/assign I will do it this week.

Syulin7 on May 18, 2023

We should wait for responses from other owners.

tenzen-y on Mar 3, 2023