argo-workflows: Wait container stuck in multi-root workflows in 2.5-rc8 to 2.5.0
Checklist:
- I’ve included the version.
- I’ve included reproduction steps.
- I’ve included the workflow YAML.
- I’ve included the logs.
What happened: In multi-root workflows in a single pod, the wait container gets stuck. It is not possible to terminate the workflow. The only way to stop it is to delete it.
The logs of the stuck container
$ kubectl logs workflow-zdkm8-552438975 -n argo -c wait
time="2020-02-19T06:06:48Z" level=info msg="Creating a docker executor"
time="2020-02-19T06:06:48Z" level=info msg="Executor (version: v0.0.0+unknown, build_date: 1970-01-01T00:00:00Z) initialized (pod: <redacted> )}}}"
time="2020-02-19T06:06:48Z" level=info msg="Waiting on main container"
time="2020-02-19T06:06:49Z" level=info msg="main container started with container ID: 05940c7081a1f3641302e6bf7d488853acfc4823f87892e4f21b72bf932a87b6"
time="2020-02-19T06:06:49Z" level=info msg="Starting annotations monitor"
time="2020-02-19T06:06:49Z" level=info msg="Execution control set from API: {2020-02-19 10:06:47 +0000 UTC false}"
time="2020-02-19T06:06:49Z" level=info msg="docker wait 05940c7081a1f3641302e6bf7d488853acfc4823f87892e4f21b72bf932a87b6"
time="2020-02-19T06:06:49Z" level=info msg="Starting deadline monitor"
time="2020-02-19T06:06:53Z" level=info msg="Main container completed"
time="2020-02-19T06:06:53Z" level=info msg="Saving logs"
time="2020-02-19T06:06:53Z" level=info msg="[docker logs 05940c7081a1f3641302e6bf7d488853acfc4823f87892e4f21b72bf932a87b6]"
time="2020-02-19T06:06:53Z" level=info msg="Annotations monitor stopped"
time="2020-02-19T06:06:53Z" level=info msg="Deadline monitor stopped"
The processes running on the container are
root@pod-m8-552438975:/# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.4 145816 34196 ? Ssl 06:06 0:00 argoexec wait
root 22 0.0 0.3 45672 28684 ? Sl 06:06 0:00 docker logs 05940c7081a1f3641302e6bf7d488853acfc4823f87892e4f21b72bf932a87b6
root 41 0.0 0.0 18136 3188 pts/0 Ss 06:11 0:00 /bin/bash
root 54 0.0 0.0 36636 2844 pts/0 R+ 06:12 0:00 ps aux
If I kill the docker logs
process with id 22
, it exits successfully.
root@pod-zdkm8-552438975:/# kill 22
root@pod-zdkm8-552438975:/# command terminated with exit code 137
Environment:
-
Argo version:
2.5-rc8
to2.5
-
Kubernetes version :
clientVersion:
buildDate: "2019-09-18T14:36:53Z"
compiler: gc
gitCommit: 2bd9643cee5b3b3a5ecbd3af49d09018f0773c77
gitTreeState: clean
gitVersion: v1.16.0
goVersion: go1.12.9
major: "1"
minor: "16"
platform: linux/amd64
serverVersion:
buildDate: "2019-11-13T11:11:50Z"
compiler: gc
gitCommit: 7015f71e75f670eb9e7ebd4b5749639d42e20079
gitTreeState: clean
gitVersion: v1.15.6
goVersion: go1.12.12
major: "1"
minor: "15"
platform: linux/
Maybe it was introduced in this commit https://github.com/argoproj/argo/pull/2136
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 26 (11 by maintainers)
Right, I think I’m getting closer. MultiReader reads all inputs sequentially, so I think once we exceed the buffer amount on stderr, docker logs blocks to write, and doesn’t exit, so we never finish reading stdin. (stupidly, I didn’t check the multireader docs, it has been a while since I used it). Need to try and recreate but should have a fix real soon.
I’m having this as well. wait container logs:
Running argo 2.6.0-rc2
then the last line just repeats…
EDIT: Disabling the
archiveLogs
seems to “fix it” for some workflows. Agree with @MrSaints , writing too much information seems to interfere with thewait
container since the workflows that remainedRunning
were the ones that wrote a lot of logs.I can confirm that v2.6.2 fixed this issue for us, thanks!