go: runtime: Windows binaries built with -race occasionally deadlock
Starting in November, there appears to be a dramatic uptick in the number of test timeouts on the windows-*
builders.
Many of these are for tests that normally run nearly instantaneously, such as archive/tar
and bufio
.
2019-11-22T03:06:22-0e02cfb/windows-amd64-race 2019-11-21T22:20:17-94e9a5e/windows-amd64-race 2019-11-21T19:27:16-f4a8bf1/windows-amd64-longtest 2019-11-21T19:09:24-2434869/windows-amd64-longtest 2019-11-21T19:09:24-2434869/windows-amd64-race 2019-11-21T16:56:47-37715cc/windows-amd64-longtest 2019-11-21T16:56:47-37715cc/windows-amd64-race 2019-11-21T16:01:14-c7e73ef/windows-amd64-race 2019-11-20T20:51:13-9852b4b/windows-amd64-race 2019-11-13T19:15:27-7ad2748/windows-amd64-longtest 2019-11-12T22:09:05-a56d755/windows-amd64-2016 2019-11-12T01:07:15-ec73263/windows-amd64-2012 2019-11-08T17:01:05-a5a6f61/windows-amd64-2012 2019-11-07T19:18:12-1b0b980/windows-amd64-2012 2019-11-07T05:52:34-3eabdd2/windows-amd64-longtest 2019-11-06T09:09:59-0c5d545/windows-amd64-2008 2019-11-06T02:52:51-f71bd51/windows-amd64-2016 2019-11-05T16:31:48-414c1d4/windows-amd64-2016 2019-11-05T14:44:56-e457cc3/windows-amd64-race 2019-11-05T05:19:08-d51f7f3/windows-amd64-longtest 2019-11-05T03:50:54-979d65d/windows-amd64-2016 2019-11-05T00:19:10-6cbd737/windows-amd64-race 2019-11-01T14:48:28-a570fcf/windows-amd64-2012 2019-03-19T08:30:50-451a2eb/windows-amd64-2008 2018-12-05T21:54:54-6454a09/windows-amd64-race
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 17 (16 by maintainers)
Alrighty thanks to @aclements and a Windows laptop we have a reproducer, a theory, and a partial fix.
The problem is a race between
SuspendThread
andExitProcess
on Windows. The order of events is as follows:Thread 1: Suspend (asynchronously) Thread 2. Thread 2: Call
ExitProcess
, which terminates all threads except Thread 2. Thread 2: InExitProcess
, receives asynchronous notification to suspend, and stops.This race is already handled in the runtime for the usual exits by putting a lock around suspending a thread (and effectively disallowing it in certain cases, like exit), but in race mode
__tsan_fini
(called byracefini
) callsExitProcess
instead. The fix is to just grab this lock before calling into__tsan_fini
.Unfortunately this raises a bigger issue: what if C code, called from Go, calls
ExitProcess
on Windows? We have no way to synchronize asynchronous preemption with that like we do with exits we can actually control. One thought is thatExitProcess
already calls a bunch of DLL hooks; could we throw in our own to side-step this issue maybe? More thought on this problem is required.