go: testing: test process hangs beyond `-timeout` if a child process holds I/O streams open
What version of Go are you using (go version
)?
go version go1.10 linux/amd64
What operating system and processor architecture are you using (go env
)?
$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/psanford/.cache/go-build"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/psanford/projects/go"
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build953969108=/tmp/go-build -gno-record-gcc-switches"
What did you do?
After upgrading to 1.10 we had one test that started to hang intermittently. The test in question starts a child process which it kills by canceling a context object at the end of the test method. It does not do an explicit cmd.Wait()
.
Here is a minimal test case that demonstrates the problem:
https://play.golang.org/p/8rq41A5Khsm
I can get this to hang consistently by running it in a bash while loop:
$ while true; do go test -timeout 5s -v -count 1 .; sleep 0.1; done
=== RUN TestOSExecNoWait
start
done
--- PASS: TestOSExecNoWait (0.01s)
PASS
ok _/tmp 0.012s
=== RUN TestOSExecNoWait
start
done
--- PASS: TestOSExecNoWait (0.01s)
PASS
ok _/tmp 0.012s
=== RUN TestOSExecNoWait
start
done
--- PASS: TestOSExecNoWait (0.01s)
PASS
ok _/tmp 0.012s
=== RUN TestOSExecNoWait
start
done
--- PASS: TestOSExecNoWait (0.01s)
PASS
<hangs here indefinitely>
If I explicitly call cmd.Wait()
the test does not hang. If I don’t attach the child process’ Stdout and Stderr to os.Std{out,err} the test does not hang.
On 1.9.4 the test does not hang.
Its also interesting that even though I specified -timeout 5s
the test runner hangs forever.
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Reactions: 2
- Comments: 22 (16 by maintainers)
Commits related to this issue
- Avoid dangling child procs in test suite. In go 1.10.1, these seem to make "go test" hang sometimes. https://github.com/golang/go/issues/24050 No issue # Arvados-DCO-1.1-Signed-off-by: Tom Clegg <... — committed to arvados/arvados by tomclegg 6 years ago
- [tests] Move integration/utils lib to pkg/testing/utils; fix liveexec hangs on windows Recursive call to `ginkgo -r` should not interract with utils package. Liveexec now uses pkg/testing/utils/gexe... — committed to werf/werf by distorhead 5 years ago
- cmd/go/internal/test: wrap os.Stdout always There is an issue where 'go test' will hang after the tests complete if a test starts a sub-process that does not exit (see #24050). However, go test only... — committed to golang/go by adg 2 years ago
- cmd/go/internal/test: scale the WaitDelay for test subprocesses Prior to CL 456116 we had an arbitrary 5-second delay after a test times out before we kill the test. In CL 456116, I reused that arbit... — committed to golang/go by bcmills a year ago
- cmd/go/internal/test: scale the WaitDelay for test subprocesses Prior to CL 456116 we had an arbitrary 5-second delay after a test times out before we kill the test. In CL 456116, I reused that arbit... — committed to Pryz/go by bcmills a year ago
I’ve encountered the same problem; I was even thinking that our CI was completely broken, as I didn’t understand how
go test ./...
, with a default timeout of 10m, could possibly run for six hours.I have a way to reproduce this reliably on Linux:
https://play.golang.org/p/nvycpKfjzkL
It seems to me like the elements are:
go test
with package arguments, so that the output is buffered for caching etcSo it seems to be exactly the same bug found previously in this issue. At a conceptual level, this seems fairly straightforward; if the test executable timeout is hit, all children processes should be killed, and any goroutines waiting for their output should be cancelled or stopped. But I assume the actual fix might be a bit tricky.
@bcmills @jayconrod @matloob as owners of cmd/go, any thoughts?
@josharian, if your subprocess is a Go program I suggest sending
SIGQUIT
on timeout instead ofSIGTERM
(so that you get a goroutine dump), but that’s currently a bit complicated.(My proposal #50436 aims to make that somewhat smoother, but the fact that
SIGQUIT
is not defined on all platforms remains a bit of a rough edge.)This just bit me as well. Ran through a fair few hours of GitHub CI time before I noticed something was wrong. (Tip: set
timeout-minutes
in your Actions even if you don’t need it.)For those finding this thread and hoping for some code to copy/paste to work around the issue, here’s the suggestion from @mvdan above, spelled out.
Change:
to something along these lines:
(@mvdan @bcmills @ianlancetaylor feel free to edit this comment directly if you’d suggest a different formulation)
The problem with a general “signal process and its children” mechanism on Unix systems is that the only options are “send signal to process” and “send signal to thread” and “send signal to process group”. And we definitely don’t want to start every new process in a different process group, as that will have surprising effects on the use of
^Z
from the terminal. And of course we have no idea what processes a child process may start.That said we could in principle invoke the
ps
program or (on GNU/Linux) read the/proc
file system to look for all processes, build a process tree, and use that to identify the children of our child process. But that seems like a rather complex mechanism to implement a standard library function.For testing specifically, it might be somewhat acceptable for
go test
to start the test binary in a separate process group, in which case it would be straightforward for the go command to kill the entire test binary process group.Also, before anyone spends significant time into a fix, I think we could just fix https://github.com/golang/go/issues/23019 instead. It would save far more time in the long run, because it’s a change we likely want to do anyway and should fix other cascading bugs. But it’s also a bit more controversial.