argo-workflows: OOM error not caught by the `emissary` executor forcing the workflow to hang in "Running" state

I am opening a new issue, but you can check https://github.com/argoproj/argo-workflows/issues/8456#issuecomment-1120206141 for context.

The below error has been reproduced on master (07/05/2022), 3.3.5 and 3.2.11.

When a workflow get OOM killed by K8s, the emissary executor sometime can’t detect it. In consequence, the workflow hangs in “Running” state forever.

The error does not happen when using the pns or docker executor. This is a major regression for us since the previous executors were working just fine. For now, we are falling back to docker.

I have been able to make Argo detect the killed process by manually sshing to the pod and killing the /var/run/argo/argoexec emissary -- bash --login /argo/staging/script process (sending a SIGTERM signal). When doing that the main container get immediately killed, as well as the workflow. The workflow is correctly marked as failed with the correct OOMKilled (exit code 137) error (the same error when using the pns and docker executors).

Unfortunately, so far all my attempts to reproduce it using openly available code, images and packages has been unsuccessful (I’ll keep trying). I can only reproduce it using our private internal stack and images.

The workload is a deeply nested machine learning code that rely a lot on the python and pytorch multiprocessing and distributed module. My guess is that some zombie child processes prevent the argo executor or workflow controller to detect the main container as completed.

I will be happy to provide you more information, logs or config if it can help you to make sense of this (while on my side I’ll keep trying to make a workflow reproducing the bug that I can share).

While this bug affects us, I am quite confident other people running ML workload with Python on Argo will get that bug at some point.

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 41 (19 by maintainers)

Commits related to this issue

fix: Reap zombies. Fixes #8680 Signed-off-by: Alex Collins <alex_collins@intuit.com> — committed to argoproj/argo-workflows by alexec 2 years ago

Most upvoted comments

Using dev-kill (versus v3.3.5) seems to fix the bug indeed.

I tested with this simple workflow and the controller correctly marks it as failed without waiting for the sleep subprocess to finish:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: test-oom-reproduce-simple-emissary-
  labels:
    organization: internal
    project: debug-emissary-oom

spec:
  entrypoint: entrypoint

  templates:
    - name: entrypoint
      script:
        image: ubuntu:22.04

        command: ["bash", "--login"]
        source: |
          set -e

          apt-get update
          apt-get install -y python3 wget bzip2 htop psmisc
          rm -rf /var/lib/apt/lists/*

          # Python script
          PYTHON_CODE=$(cat <<END

          import subprocess

          print("Spawn a subprocess that will terminate in 600s")
          subprocess.Popen(["sleep", "600s"])

          do_something_that_will_fail()

          END
          )
          echo "$PYTHON_CODE" > /tmp/python_code.py

          echo "START"
          python3 /tmp/python_code.py
          echo "DONE"

I also tested with a workflow spawning a never-ending subprocess and also triggering an OOM error in the main process (see above in the thread) and I confirm the controller can detect it, terminate the workflow and mark it as failed + showing the correct error OOMKilled (exit code 137).

hadim on May 26, 2022

Hi. I’ve pushed a new version which I hope/expect to fix the issue. Would you test please?

alexec on May 26, 2022

That works. Can you send invite to alex_collins@intuit.com

alexec on May 10, 2022

I just tested it again with our problematic workflow and everything is the same as before. Logs show the script has been killed but the workflow is stuck in “Running” phase.

hadim on May 9, 2022