argo-workflows: emissary - tasks with orphan background processes hang forever
Summary
Downstream KFP issue found in https://github.com/kubeflow/pipelines/pull/5926
What happened/what you expected to happen?
I reproduced the problem using this minimal workflow:
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: background-
spec:
# serviceAccountName: pipeline-runner
entrypoint: bg
templates:
- name: bg
container:
image: alpine
command:
- sh
- -c
- |
set -ex
(sleep 10 && echo "wake up") & # sleep for 10 seconds in the background and then echo "wake up"
echo "exit" # after echoing "exit", the main script should finish
With emissary executor, I got logs:
background-jvqfm: + echo exit
background-jvqfm: + sleep 10
background-jvqfm: exit
background-jvqfm: + echo 'wake up'
background-jvqfm: wake up
And if we change to sleep a very long time, the step hangs forever.
But with docker executor, I got logs:
background-mxp9q: exit
background-mxp9q: + echo exit
So the container is killed as soon as the main script finishes.
Expectation: when main script finishes, the step should stop. For clarification, I’m only logging the issue, because it took a while for me to debug and understand this difference between docker and emissary executor.
There’s no difference between the different behaviors in this simple case, but Real World Use Case: I have a test step that runs several background services forever. https://github.com/kubeflow/pipelines/blob/74d27e7e7ec88c62154a3dc5fb19cb27fc2922e6/test/frontend-integration-test/run_test.sh#L63 With docker executor, the script can be very simple, just run the services and forget about them. They will be killed automatically when the main script finishes.
However, with emissary executor, I need to kill the background services by myself, otherwise the step is stuck forever.
Diagnostics
What Kubernetes provider are you using? GKE
What version of Argo Workflows are you running? v3.1.2
What executor are you running? Docker/K8SAPI/Kubelet/PNS/Emissary Comparing Docker and Emissary
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 3
- Comments: 16 (15 by maintainers)
Commits related to this issue
- test: Add e2e test for background procs. Fixes #6375 Signed-off-by: Alex Collins <alex_collins@intuit.com> — committed to alexec/argo-workflows by alexec 3 years ago
@Bobgy I’ve tried to reproduce this again today, but failed to do so. Your sample workflow exited successfully. I think this bug is fixed.
So far, I’m stuck on understanding why command.Wait() waited until the script and the background processes together. I’m not seeing this behavior else where.
I’m not sure how many users will actually hit this, so I’d like to wait a bit more feedback before continuing to work on this.
That’s a good catch for discovering I used different shells! but I tried again comparing both sh and bash, both of them print
wake up
at the end, when running the go code on both linux and mac. Did I miss anything?+1 on using dumb-init, that sounds like a robust solution.
Another simple workaround without adding a dependency is adding a trap for the shell script like https://github.com/kubeflow/pipelines/pull/5926/commits/a80061f0ba8a951d96ed7f2585d4c08419e0c1d2