argo-workflows: Wait container stuck in multi-root workflows in 2.5-rc8 to 2.5.0

Checklist:

I’ve included the version.
I’ve included reproduction steps.
I’ve included the workflow YAML.
I’ve included the logs.

What happened: In multi-root workflows in a single pod, the wait container gets stuck. It is not possible to terminate the workflow. The only way to stop it is to delete it.

The logs of the stuck container

$ kubectl logs workflow-zdkm8-552438975 -n argo -c wait
time="2020-02-19T06:06:48Z" level=info msg="Creating a docker executor"
time="2020-02-19T06:06:48Z" level=info msg="Executor (version: v0.0.0+unknown, build_date: 1970-01-01T00:00:00Z) initialized (pod: <redacted> )}}}"
time="2020-02-19T06:06:48Z" level=info msg="Waiting on main container"
time="2020-02-19T06:06:49Z" level=info msg="main container started with container ID: 05940c7081a1f3641302e6bf7d488853acfc4823f87892e4f21b72bf932a87b6"
time="2020-02-19T06:06:49Z" level=info msg="Starting annotations monitor"
time="2020-02-19T06:06:49Z" level=info msg="Execution control set from API: {2020-02-19 10:06:47 +0000 UTC false}"
time="2020-02-19T06:06:49Z" level=info msg="docker wait 05940c7081a1f3641302e6bf7d488853acfc4823f87892e4f21b72bf932a87b6"
time="2020-02-19T06:06:49Z" level=info msg="Starting deadline monitor"
time="2020-02-19T06:06:53Z" level=info msg="Main container completed"
time="2020-02-19T06:06:53Z" level=info msg="Saving logs"
time="2020-02-19T06:06:53Z" level=info msg="[docker logs 05940c7081a1f3641302e6bf7d488853acfc4823f87892e4f21b72bf932a87b6]"
time="2020-02-19T06:06:53Z" level=info msg="Annotations monitor stopped"
time="2020-02-19T06:06:53Z" level=info msg="Deadline monitor stopped"

The processes running on the container are

root@pod-m8-552438975:/# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.4 145816 34196 ?        Ssl  06:06   0:00 argoexec wait
root        22  0.0  0.3  45672 28684 ?        Sl   06:06   0:00 docker logs 05940c7081a1f3641302e6bf7d488853acfc4823f87892e4f21b72bf932a87b6
root        41  0.0  0.0  18136  3188 pts/0    Ss   06:11   0:00 /bin/bash
root        54  0.0  0.0  36636  2844 pts/0    R+   06:12   0:00 ps aux

If I kill the docker logs process with id 22, it exits successfully.

root@pod-zdkm8-552438975:/# kill 22
root@pod-zdkm8-552438975:/# command terminated with exit code 137

Environment:

Argo version: 2.5-rc8 to 2.5
Kubernetes version :

clientVersion:
  buildDate: "2019-09-18T14:36:53Z"
  compiler: gc
  gitCommit: 2bd9643cee5b3b3a5ecbd3af49d09018f0773c77
  gitTreeState: clean
  gitVersion: v1.16.0
  goVersion: go1.12.9
  major: "1"
  minor: "16"
  platform: linux/amd64
serverVersion:
  buildDate: "2019-11-13T11:11:50Z"
  compiler: gc
  gitCommit: 7015f71e75f670eb9e7ebd4b5749639d42e20079
  gitTreeState: clean
  gitVersion: v1.15.6
  goVersion: go1.12.12
  major: "1"
  minor: "15"
  platform: linux/

Maybe it was introduced in this commit https://github.com/argoproj/argo/pull/2136

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 26 (11 by maintainers)

Commits related to this issue

fix(docker): fix streaming of combined stdout/stderr This PR fixes #2261. Multireader concactinates, not combines, the two streams, this meant that we blocked on stdin if stderr was not completely bu... — committed to tcolgate/argo by deleted user 4 years ago

Most upvoted comments

Right, I think I’m getting closer. MultiReader reads all inputs sequentially, so I think once we exceed the buffer amount on stderr, docker logs blocks to write, and doesn’t exit, so we never finish reading stdin. (stupidly, I didn’t check the multireader docs, it has been a while since I used it). Need to try and recreate but should have a fix real soon.

tcolgate on Mar 5, 2020

I’m having this as well. wait container logs:

time="2020-02-25T01:02:00Z" level=info msg="Waiting on main container"
time="2020-02-25T01:02:05Z" level=info msg="main container started with container ID: e0c64b7eff38ddfdc5ac705044c685b3ee8f260ef610f2cd8b83a4190ad20394"
time="2020-02-25T01:02:05Z" level=info msg="Starting annotations monitor"
time="2020-02-25T01:02:06Z" level=info msg="docker wait e0c64b7eff38ddfdc5ac705044c685b3ee8f260ef610f2cd8b83a4190ad20394"
time="2020-02-25T01:02:06Z" level=info msg="Starting deadline monitor"
time="2020-02-25T01:07:00Z" level=info msg="Alloc=3200 TotalAlloc=9568 Sys=68674 NumGC=5 Goroutines=11"
time="2020-02-25T01:10:51Z" level=info msg="Main container completed"
time="2020-02-25T01:10:51Z" level=info msg="Saving logs"
time="2020-02-25T01:10:51Z" level=info msg="Annotations monitor stopped"
time="2020-02-25T01:10:51Z" level=info msg="[docker logs e0c64b7eff38ddfdc5ac705044c685b3ee8f260ef610f2cd8b83a4190ad20394]"
time="2020-02-25T01:10:52Z" level=info msg="Deadline monitor stopped"
time="2020-02-25T01:12:00Z" level=info msg="Alloc=3309 TotalAlloc=9685 Sys=70080 NumGC=7 Goroutines=6"
time="2020-02-25T01:17:00Z" level=info msg="Alloc=3217 TotalAlloc=9686 Sys=70080 NumGC=10 Goroutines=6"

Running argo 2.6.0-rc2

then the last line just repeats…

EDIT: Disabling the archiveLogs seems to “fix it” for some workflows. Agree with @MrSaints , writing too much information seems to interfere with the wait container since the workflows that remained Running were the ones that wrote a lot of logs.

rmgpinto on Feb 25, 2020

I can confirm that v2.6.2 fixed this issue for us, thanks!

dguendisch on Mar 13, 2020