argo-workflows: emissary - tasks with orphan background processes hang forever

Summary

Downstream KFP issue found in https://github.com/kubeflow/pipelines/pull/5926

What happened/what you expected to happen?

I reproduced the problem using this minimal workflow:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: background-
spec:
  # serviceAccountName: pipeline-runner
  entrypoint: bg
  templates:
  - name: bg
    container:
      image: alpine
      command:
      - sh
      - -c
      - |
        set -ex
        (sleep 10 && echo "wake up") &  # sleep for 10 seconds in the background and then echo "wake up"
        echo "exit" # after echoing "exit", the main script should finish

With emissary executor, I got logs:

background-jvqfm: + echo exit
background-jvqfm: + sleep 10
background-jvqfm: exit
background-jvqfm: + echo 'wake up'
background-jvqfm: wake up

And if we change to sleep a very long time, the step hangs forever.

But with docker executor, I got logs:

background-mxp9q: exit
background-mxp9q: + echo exit

So the container is killed as soon as the main script finishes.

Expectation: when main script finishes, the step should stop. For clarification, I’m only logging the issue, because it took a while for me to debug and understand this difference between docker and emissary executor.

There’s no difference between the different behaviors in this simple case, but Real World Use Case: I have a test step that runs several background services forever. https://github.com/kubeflow/pipelines/blob/74d27e7e7ec88c62154a3dc5fb19cb27fc2922e6/test/frontend-integration-test/run_test.sh#L63 With docker executor, the script can be very simple, just run the services and forget about them. They will be killed automatically when the main script finishes.

However, with emissary executor, I need to kill the background services by myself, otherwise the step is stuck forever.

Diagnostics

What Kubernetes provider are you using? GKE

What version of Argo Workflows are you running? v3.1.2

What executor are you running? Docker/K8SAPI/Kubelet/PNS/Emissary Comparing Docker and Emissary

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 3
Comments: 16 (15 by maintainers)

Commits related to this issue

test: Add e2e test for background procs. Fixes #6375 Signed-off-by: Alex Collins <alex_collins@intuit.com> — committed to alexec/argo-workflows by alexec 3 years ago

Most upvoted comments

@Bobgy I’ve tried to reproduce this again today, but failed to do so. Your sample workflow exited successfully. I think this bug is fixed.

alexec on Nov 22, 2021

So far, I’m stuck on understanding why command.Wait() waited until the script and the background processes together. I’m not seeing this behavior else where.

I’m not sure how many users will actually hit this, so I’d like to wait a bit more feedback before continuing to work on this.

Bobgy on Jul 22, 2021

That’s a good catch for discovering I used different shells! but I tried again comparing both sh and bash, both of them print wake up at the end, when running the go code on both linux and mac. Did I miss anything?

+1 on using dumb-init, that sounds like a robust solution.

Another simple workaround without adding a dependency is adding a trap for the shell script like https://github.com/kubeflow/pipelines/pull/5926/commits/a80061f0ba8a951d96ed7f2585d4c08419e0c1d2

function clean_up() {
  set +e # no longer error on command failure, because we want to go through all the jobs

  # in this example, I always have two background jobs, adjust the lines based on your script
  echo "Stopping background jobs..."
  kill -15 %1
  kill -15 %2
}
trap clean_up EXIT SIGINT SIGTERM

Bobgy on Jul 22, 2021