go: testing: test process hangs beyond `-timeout` if a child process holds I/O streams open

What version of Go are you using (go version)?

go version go1.10 linux/amd64

What operating system and processor architecture are you using (go env)?

$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/psanford/.cache/go-build"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/psanford/projects/go"
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build953969108=/tmp/go-build -gno-record-gcc-switches"

What did you do?

After upgrading to 1.10 we had one test that started to hang intermittently. The test in question starts a child process which it kills by canceling a context object at the end of the test method. It does not do an explicit cmd.Wait().

Here is a minimal test case that demonstrates the problem:

https://play.golang.org/p/8rq41A5Khsm

I can get this to hang consistently by running it in a bash while loop:

$ while true; do go test -timeout 5s -v -count 1 .; sleep 0.1; done
=== RUN   TestOSExecNoWait
start
done
--- PASS: TestOSExecNoWait (0.01s)
PASS
ok      _/tmp   0.012s
=== RUN   TestOSExecNoWait
start
done
--- PASS: TestOSExecNoWait (0.01s)
PASS
ok      _/tmp   0.012s
=== RUN   TestOSExecNoWait
start
done
--- PASS: TestOSExecNoWait (0.01s)
PASS
ok      _/tmp   0.012s
=== RUN   TestOSExecNoWait
start
done
--- PASS: TestOSExecNoWait (0.01s)
PASS

<hangs here indefinitely>

If I explicitly call cmd.Wait() the test does not hang. If I don’t attach the child process’ Stdout and Stderr to os.Std{out,err} the test does not hang.

On 1.9.4 the test does not hang.

Its also interesting that even though I specified -timeout 5s the test runner hangs forever.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 2
  • Comments: 22 (16 by maintainers)

Commits related to this issue

Most upvoted comments

I’ve encountered the same problem; I was even thinking that our CI was completely broken, as I didn’t understand how go test ./..., with a default timeout of 10m, could possibly run for six hours.

I have a way to reproduce this reliably on Linux:

https://play.golang.org/p/nvycpKfjzkL

$ go test -timeout=1s
panic: test timed out after 1s
[...]
[exits]
$ go test -timeout=1s .
[hangs for a long time]

It seems to me like the elements are:

  • Running go test with package arguments, so that the output is buffered for caching etc
  • The test executes a program which hangs or runs for a long time
  • The executed program’s output is wired to the test’s own output

So it seems to be exactly the same bug found previously in this issue. At a conceptual level, this seems fairly straightforward; if the test executable timeout is hit, all children processes should be killed, and any goroutines waiting for their output should be cancelled or stopped. But I assume the actual fix might be a bit tricky.

@bcmills @jayconrod @matloob as owners of cmd/go, any thoughts?

@josharian, if your subprocess is a Go program I suggest sending SIGQUIT on timeout instead of SIGTERM (so that you get a goroutine dump), but that’s currently a bit complicated.

(My proposal #50436 aims to make that somewhat smoother, but the fact that SIGQUIT is not defined on all platforms remains a bit of a rough edge.)

This just bit me as well. Ran through a fair few hours of GitHub CI time before I noticed something was wrong. (Tip: set timeout-minutes in your Actions even if you don’t need it.)

For those finding this thread and hoping for some code to copy/paste to work around the issue, here’s the suggestion from @mvdan above, spelled out.

Change:

cmd := exec.Command("YOUR_PROGRAM")

to something along these lines:

ctx := context.Background()
deadline, ok := t.Deadline()
if ok {
	// Give ourselves 100ms to kill the command before the actual deadline arrives.
	deadline = deadline.Add(-100 * time.Millisecond)
	var cancel func()
	ctx, cancel = context.WithDeadline(ctx, deadline)
	t.Cleanup(cancel)
}
cmd := exec.CommandContext(ctx, "YOUR_PROGRAM")

(@mvdan @bcmills @ianlancetaylor feel free to edit this comment directly if you’d suggest a different formulation)

The problem with a general “signal process and its children” mechanism on Unix systems is that the only options are “send signal to process” and “send signal to thread” and “send signal to process group”. And we definitely don’t want to start every new process in a different process group, as that will have surprising effects on the use of ^Z from the terminal. And of course we have no idea what processes a child process may start.

That said we could in principle invoke the ps program or (on GNU/Linux) read the /proc file system to look for all processes, build a process tree, and use that to identify the children of our child process. But that seems like a rather complex mechanism to implement a standard library function.

For testing specifically, it might be somewhat acceptable for go test to start the test binary in a separate process group, in which case it would be straightforward for the go command to kill the entire test binary process group.

Also, before anyone spends significant time into a fix, I think we could just fix https://github.com/golang/go/issues/23019 instead. It would save far more time in the long run, because it’s a change we likely want to do anyway and should fix other cascading bugs. But it’s also a bit more controversial.