go: runtime: race between stack shrinking and channel send/recv leads to bad sudog values
Internally we’ve seen a rare crash arise in the runtime since Go 1.14. The error message is typically sudog with non-nil elem
stemming from a call to releaseSudog
from chansend
or chanrecv
.
The issue here is a race between a mark worker and a channel operation. Consider the following sequence of events. GW is a worker G. GS is a G trying to send on a channel. GR is a G trying to receive on that same channel.
- GW wants to suspend GS to scan its stack. It calls
suspendG
. - GS is about to
gopark
inchansend
. It calls intogopark
, and changes its status to_Gwaiting
BEFORE calling itsunlockf
, which setsgp.activeStackChans
. - GW observes
_Gwaiting
and returns fromsuspendG
. It continues intoscanstack
where it checks if it’s safe to shrink the stack. In this case, it’s fine. So, it readsgp.activeStackChans
, and sees it as false. It begins adjustingsudog
pointers without synchronization. It reads thesudog
’selem
pointer from thechansend
, but has not written it back yet. - GS continues on its merry way and sets
gp.activeStackChans
and parks. It doesn’t really matter when this happens at this point. - GR comes in and wants to
chanrecv
on channel. It grabs the channel lock, reads from thesudog
’selem
field, and clears it. GR readies GS. - GW then writes the updated sudog’s elem pointer and continues on its merry way.
- Sometime later, GS wakes up because it was readied by GR, and tries to release the
sudog
, which has a non-nilelem
field.
The fix here, I believe, is to set gp.activeStackChans
before the unlockf
is called. Doing this ensures that the value is updated before any worker that could shrink GS’s stack observes a useful G status in suspendG
. This could alternatively be fixed by changing the G status after unlockf
is called, but I worry that will break a lot of things.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 26 (19 by maintainers)
Commits related to this issue
- runtime: expand gopark documentation unlockf is called after the G is put into _Gwaiting, meaning another G may have readied this one before unlockf is called. This is implied by the current doc, bu... — committed to golang/go by prattmic 4 years ago
- [release-branch.go1.15] runtime: disable stack shrinking in activeStackChans race window Currently activeStackChans is set before a goroutine blocks on a channel operation in an unlockf passed to gop... — committed to golang/go by mknyszek 4 years ago
- [release-branch.go1.14] runtime: disable stack shrinking in activeStackChans race window Currently activeStackChans is set before a goroutine blocks on a channel operation in an unlockf passed to gop... — committed to golang/go by mknyszek 4 years ago
- [release-branch.go1.15] runtime: disable stack shrinking in activeStackChans race window Currently activeStackChans is set before a goroutine blocks on a channel operation in an unlockf passed to gop... — committed to claucece/go by mknyszek 4 years ago
- golang: v↑ go1.14 (1.14.9 -> 1.14.10) Going Go1.14.9 -> Go1.14.10 brings in compiler and runtime fixes including fix for crash in garbage-collector due to race condition: https://github.com/golang/g... — committed to SlapOS/slapos by navytux 4 years ago
OK I think I have a fix (trybots look good so far, 🤞), but still no working test. The goal of the test I’m trying to write is to create a situation where a goroutine whose stack is ripe for shrinking keeps going to sleep and waking up on channels. Meanwhile, a GC is triggered to try and get a mark worker to race with that goroutine going to sleep. I wrote a test which I’ve confirmed races stack shrinks with channel sends and receives, but I’m noticing a couple problems in trying to reproduce the issue.
export_test.go
which doessuspendG
followed byshrinkstack
in a loop. This way we don’t need to wait for the whole GC to end before trying again.As a result of these problems, I’m not certain that this is feasible to write a test for, at least not a short-running one, if we want it to fail reliably. But, maybe I’m just going about this the wrong way (i.e. a stress test is just the wrong choice here)? Any advice?
By the way, I look forward to the test case.