training-operator: MPIJob doesn't support exitcode restartPolicy

  1. Add restart policy ExitCode for launcher.
  2. Delete one of the running worker, the launcher will be failed, exit code is 137.
  3. And then worker re-created, launcher never restarts.

laucher log:

[tensorflow-mnist-launcher:00001] Warning: could not find environment variable "LD_LIBRARY_PATH"
+ POD_NAME=tensorflow-mnist-worker-1
+ shift
+ /opt/kube/kubectl exec tensorflow-mnist-worker-1 -- /bin/sh -c     PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "3314941952" -mca ess_base_vpid 2 -mca ess_base_num_procs "3" -mca orte_node_regex "tensorflow-mnist-launcher,tensorflow-mnist-worker-[1:0-1]@0(3)" -mca orte_hnp_uri "3314941952.0;tcp://10.244.48.135:48777" -mca pml "ob1" -mca btl "^openib" -mca plm "rsh" --tree-spawn -mca orte_parent_uri "3314941952.0;tcp://10.244.48.135:48777" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated"
+ POD_NAME=tensorflow-mnist-worker-0
+ shift
+ /opt/kube/kubectl exec tensorflow-mnist-worker-0 -- /bin/sh -c     PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "3314941952" -mca ess_base_vpid 1 -mca ess_base_num_procs "3" -mca orte_node_regex "tensorflow-mnist-launcher,tensorflow-mnist-worker-[1:0-1]@0(3)" -mca orte_hnp_uri "3314941952.0;tcp://10.244.48.135:48777" -mca pml "ob1" -mca btl "^openib" -mca plm "rsh" --tree-spawn -mca orte_parent_uri "3314941952.0;tcp://10.244.48.135:48777" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated"
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[50582,0],0] on node tensorflow-mnist-launcher
  Remote daemon: [[50582,0],1] on node tensorflow-mnist-worker-0

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
command terminated with exit code 137

yaml:

apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  name: tensorflow-mnist
spec:
  slotsPerWorker: 1
  runPolicy:
    cleanPodPolicy: Running
    backoffLimit: 3
  mpiReplicaSpecs:
    Launcher:
      restartPolicy: ExitCode
      replicas: 1
      template:
        spec:
          containers:
          - image: horovod/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.5.0-py3.7-cpu
            name: mpi
            command:
            - mpirun
            args:
            - -np
            - "2"
            - --allow-run-as-root
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - python
            - /examples/tensorflow2_mnist.py
            resources:
              limits:
                cpu: 1
                memory: 2Gi
    Worker:
      restartPolicy: ExitCode
      replicas: 2
      template:
        spec:
          containers:
          - image: horovod/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.5.0-py3.7-cpu
            name: mpi
            resources:
              limits:
                cpu: 2
                memory: 4Gi

Does MPIJob support exitcode restart policy?

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 22 (17 by maintainers)

Most upvoted comments

Let’s move the discussion about deprecation to a new issue #1906

Hi @Syulin7, did you get a chance to work on this ? Thank you for your contributions!

@andreyvelich Sorry for late reply(Forgot to check the message.). I would like to work on this.

/assign I will do it this week.

We should wait for responses from other owners.