argo-workflows: OOM error not caught by the `emissary` executor forcing the workflow to hang in "Running" state

I am opening a new issue, but you can check https://github.com/argoproj/argo-workflows/issues/8456#issuecomment-1120206141 for context.


The below error has been reproduced on master (07/05/2022), 3.3.5 and 3.2.11.

When a workflow get OOM killed by K8s, the emissary executor sometime can’t detect it. In consequence, the workflow hangs in “Running” state forever.

The error does not happen when using the pns or docker executor. This is a major regression for us since the previous executors were working just fine. For now, we are falling back to docker.

I have been able to make Argo detect the killed process by manually sshing to the pod and killing the /var/run/argo/argoexec emissary -- bash --login /argo/staging/script process (sending a SIGTERM signal). When doing that the main container get immediately killed, as well as the workflow. The workflow is correctly marked as failed with the correct OOMKilled (exit code 137) error (the same error when using the pns and docker executors).

Unfortunately, so far all my attempts to reproduce it using openly available code, images and packages has been unsuccessful (I’ll keep trying). I can only reproduce it using our private internal stack and images.

The workload is a deeply nested machine learning code that rely a lot on the python and pytorch multiprocessing and distributed module. My guess is that some zombie child processes prevent the argo executor or workflow controller to detect the main container as completed.

I will be happy to provide you more information, logs or config if it can help you to make sense of this (while on my side I’ll keep trying to make a workflow reproducing the bug that I can share).

While this bug affects us, I am quite confident other people running ML workload with Python on Argo will get that bug at some point.

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 41 (19 by maintainers)

Commits related to this issue

Most upvoted comments

Using dev-kill (versus v3.3.5) seems to fix the bug indeed.

I tested with this simple workflow and the controller correctly marks it as failed without waiting for the sleep subprocess to finish:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: test-oom-reproduce-simple-emissary-
  labels:
    organization: internal
    project: debug-emissary-oom

spec:
  entrypoint: entrypoint

  templates:
    - name: entrypoint
      script:
        image: ubuntu:22.04

        command: ["bash", "--login"]
        source: |
          set -e

          apt-get update
          apt-get install -y python3 wget bzip2 htop psmisc
          rm -rf /var/lib/apt/lists/*

          # Python script
          PYTHON_CODE=$(cat <<END

          import subprocess

          print("Spawn a subprocess that will terminate in 600s")
          subprocess.Popen(["sleep", "600s"])

          do_something_that_will_fail()

          END
          )
          echo "$PYTHON_CODE" > /tmp/python_code.py

          echo "START"
          python3 /tmp/python_code.py
          echo "DONE"

I also tested with a workflow spawning a never-ending subprocess and also triggering an OOM error in the main process (see above in the thread) and I confirm the controller can detect it, terminate the workflow and mark it as failed + showing the correct error OOMKilled (exit code 137).

Hi. I’ve pushed a new version which I hope/expect to fix the issue. Would you test please?

That works. Can you send invite to alex_collins@intuit.com

I just tested it again with our problematic workflow and everything is the same as before. Logs show the script has been killed but the workflow is stuck in “Running” phase.