go: runtime: hangs in TestGdbBacktrace on linux
2020-02-22T04:31:41-059a5ac/linux-mips64le-mengzhuo
goroutine 23401 [syscall, 11 minutes]:
syscall.Syscall6(0x1475, 0x1, 0x3b6f, 0xc000cad968, 0x1000004, 0x0, 0x0, 0x120153878, 0xc000cad960, 0x120080418)
/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/syscall/asm_linux_mips64x.s:40 +0x10 fp=0xc000cad910 sp=0xc000cad908 pc=0x1200cc678
os.(*Process).blockUntilWaitable(0xc00001ab70, 0x0, 0x1200103ec, 0x0)
/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/os/wait_waitid.go:31 +0x88 fp=0xc000cad9f8 sp=0xc000cad910 pc=0x1200e49e8
os.(*Process).wait(0xc00001ab70, 0x1202d17d0, 0x1202d17d8, 0x1202d17c8)
/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/os/exec_unix.go:22 +0x4c fp=0xc000cada68 sp=0xc000cad9f8 pc=0x1200df43c
os.(*Process).Wait(...)
/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/os/exec.go:125
os/exec.(*Cmd).Wait(0xc000ad8f20, 0x0, 0x0)
/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/os/exec/exec.go:502 +0x68 fp=0xc000cadad8 sp=0xc000cada68 pc=0x120153e18
os/exec.(*Cmd).Run(0xc000ad8f20, 0xc00009aff0, 0xc000ad8f20)
/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/os/exec/exec.go:340 +0x74 fp=0xc000cadaf8 sp=0xc000cadad8 pc=0x12015340c
os/exec.(*Cmd).CombinedOutput(0xc000ad8f20, 0x3, 0xc000cade78, 0xf, 0xf, 0xc000ad8f20)
/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/os/exec/exec.go:562 +0xbc fp=0xc000cadb20 sp=0xc000cadaf8 pc=0x1201541dc
runtime_test.TestGdbBacktrace(0xc00022a5a0)
/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/runtime/runtime-gdb_test.go:388 +0x6c4 fp=0xc000cadf80 sp=0xc000cadb20 pc=0x120202e64
testing.tRunner(0xc00022a5a0, 0x1202d2e00)
/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/testing/testing.go:992 +0xf8 fp=0xc000cadfc8 sp=0xc000cadf80 pc=0x12010a978
runtime.goexit()
/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/runtime/asm_mips64x.s:646 +0x4 fp=0xc000cadfc8 sp=0xc000cadfc8 pc=0x120084354
created by testing.(*T).Run
/tmp/workdir-host-linux-mipsle-mengzhuo/go/src/testing/testing.go:1043 +0x378
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 24 (15 by maintainers)
Commits related to this issue
- testenv: abstract run-with-timeout into testenv This lifts the logic to run a subcommand with a timeout in a test from the runtime's runTestProg into testenv. The implementation is unchanged in this ... — committed to golang/go by aclements 3 years ago
- testenv: kill subprocess if SIGQUIT doesn't do it This makes testenv.RunWithTimeout first attempt to SIGQUIT the subprocess to get a useful Go traceback, but if that doesn't work, it sends a SIGKILL ... — committed to golang/go by aclements 3 years ago
- runtime: run gdb with a timeout for TestGdbBacktrace This sometimes times out and we don't have any useful output for debugging it. Hopefully this will help. For #37405. Change-Id: I79074e6fbb9bd16... — committed to golang/go by aclements 3 years ago
- internal/testenv: remove RunWithTimout For most tests, the test's deadline itself is more appropriate than an arbitrary timeout layered atop of it (especially once #48157 is implemented), and testenv... — committed to golang/go by bcmills 2 years ago
- internal/testenv: remove RunWithTimout For most tests, the test's deadline itself is more appropriate than an arbitrary timeout layered atop of it (especially once #48157 is implemented), and testenv... — committed to TroutSoftware/go by bcmills 2 years ago
- runtime: eliminate arbitrary timeouts in runBuiltTestProg and TestGdbBacktrace This may fix the TestEINTR failures that have been frequent on the riscv64 builders since CL 445597. Updates #37405. Up... — committed to golang/go by bcmills 2 years ago
Digging around a bit, it looks like signaling a zombie process will not change its exit status, so this is GDB failing to exit when its inferior exits.
Given the “[Inferior 1 (process 1173835) exited normally]” at the end of the GDB output, this is either a bug in GDB where it doesn’t properly exit, or the test is somehow missing the fact that GDB is exiting. I think the “gdb exited with error: signal: killed” indicates that the GDB process was still around to be killed, but I’m not entirely sure what happens if you send a signal to a zombie.
If this is a GDB bug, that’s unfortunate. We could work around it by looking at the GDB output as its running and killing it if it looks complete enough, or by just using a short timeout and accepting correct output even if it timed out.
Since we’ve at least made tangible progress on diagnosing the problem during the 1.18 cycle, I think it would be ok to move this back to the Backlog milestone and/or mark it WaitingForInfo while we wait for another repro.
It’s unfortunate but not terribly surprising for flaky tests not to reproduce as often during the code freeze, because the rate of test runs (especially for fast and/or scalable builders) tends to be much higher during the active development window.
We could implement our own timeout in TestGdbBacktrace so it can fail cleanly and print the output it has so far from GDB.
This turns out not to be specific to the
mips64le
builder. See also #39228 (occasional failures instead of hangs).2021-01-23T19:46:06-9897655/linux-amd64-sid 2020-11-02T03:03:16-0387bed/linux-386-softfloat 2020-06-25T12:02:38-334752d/linux-amd64-staticlockranking