go: runtime: Windows binaries built with -race occasionally deadlock
Starting in November, there appears to be a dramatic uptick in the number of test timeouts on the windows-* builders.
Many of these are for tests that normally run nearly instantaneously, such as archive/tar and bufio.
2019-11-22T03:06:22-0e02cfb/windows-amd64-race 2019-11-21T22:20:17-94e9a5e/windows-amd64-race 2019-11-21T19:27:16-f4a8bf1/windows-amd64-longtest 2019-11-21T19:09:24-2434869/windows-amd64-longtest 2019-11-21T19:09:24-2434869/windows-amd64-race 2019-11-21T16:56:47-37715cc/windows-amd64-longtest 2019-11-21T16:56:47-37715cc/windows-amd64-race 2019-11-21T16:01:14-c7e73ef/windows-amd64-race 2019-11-20T20:51:13-9852b4b/windows-amd64-race 2019-11-13T19:15:27-7ad2748/windows-amd64-longtest 2019-11-12T22:09:05-a56d755/windows-amd64-2016 2019-11-12T01:07:15-ec73263/windows-amd64-2012 2019-11-08T17:01:05-a5a6f61/windows-amd64-2012 2019-11-07T19:18:12-1b0b980/windows-amd64-2012 2019-11-07T05:52:34-3eabdd2/windows-amd64-longtest 2019-11-06T09:09:59-0c5d545/windows-amd64-2008 2019-11-06T02:52:51-f71bd51/windows-amd64-2016 2019-11-05T16:31:48-414c1d4/windows-amd64-2016 2019-11-05T14:44:56-e457cc3/windows-amd64-race 2019-11-05T05:19:08-d51f7f3/windows-amd64-longtest 2019-11-05T03:50:54-979d65d/windows-amd64-2016 2019-11-05T00:19:10-6cbd737/windows-amd64-race 2019-11-01T14:48:28-a570fcf/windows-amd64-2012 2019-03-19T08:30:50-451a2eb/windows-amd64-2008 2018-12-05T21:54:54-6454a09/windows-amd64-race
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 17 (16 by maintainers)
Alrighty thanks to @aclements and a Windows laptop we have a reproducer, a theory, and a partial fix.
The problem is a race between
SuspendThreadandExitProcesson Windows. The order of events is as follows:Thread 1: Suspend (asynchronously) Thread 2. Thread 2: Call
ExitProcess, which terminates all threads except Thread 2. Thread 2: InExitProcess, receives asynchronous notification to suspend, and stops.This race is already handled in the runtime for the usual exits by putting a lock around suspending a thread (and effectively disallowing it in certain cases, like exit), but in race mode
__tsan_fini(called byracefini) callsExitProcessinstead. The fix is to just grab this lock before calling into__tsan_fini.Unfortunately this raises a bigger issue: what if C code, called from Go, calls
ExitProcesson Windows? We have no way to synchronize asynchronous preemption with that like we do with exits we can actually control. One thought is thatExitProcessalready calls a bunch of DLL hooks; could we throw in our own to side-step this issue maybe? More thought on this problem is required.