go: runtime: race between stack shrinking and channel send/recv leads to bad sudog values
Internally we’ve seen a rare crash arise in the runtime since Go 1.14. The error message is typically sudog with non-nil elem stemming from a call to releaseSudog from chansend or chanrecv.
The issue here is a race between a mark worker and a channel operation. Consider the following sequence of events. GW is a worker G. GS is a G trying to send on a channel. GR is a G trying to receive on that same channel.
- GW wants to suspend GS to scan its stack. It calls
suspendG. - GS is about to
goparkinchansend. It calls intogopark, and changes its status to_GwaitingBEFORE calling itsunlockf, which setsgp.activeStackChans. - GW observes
_Gwaitingand returns fromsuspendG. It continues intoscanstackwhere it checks if it’s safe to shrink the stack. In this case, it’s fine. So, it readsgp.activeStackChans, and sees it as false. It begins adjustingsudogpointers without synchronization. It reads thesudog’selempointer from thechansend, but has not written it back yet. - GS continues on its merry way and sets
gp.activeStackChansand parks. It doesn’t really matter when this happens at this point. - GR comes in and wants to
chanrecvon channel. It grabs the channel lock, reads from thesudog’selemfield, and clears it. GR readies GS. - GW then writes the updated sudog’s elem pointer and continues on its merry way.
- Sometime later, GS wakes up because it was readied by GR, and tries to release the
sudog, which has a non-nilelemfield.
The fix here, I believe, is to set gp.activeStackChans before the unlockf is called. Doing this ensures that the value is updated before any worker that could shrink GS’s stack observes a useful G status in suspendG. This could alternatively be fixed by changing the G status after unlockf is called, but I worry that will break a lot of things.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 26 (19 by maintainers)
Commits related to this issue
- runtime: expand gopark documentation unlockf is called after the G is put into _Gwaiting, meaning another G may have readied this one before unlockf is called. This is implied by the current doc, bu... — committed to golang/go by prattmic 4 years ago
- [release-branch.go1.15] runtime: disable stack shrinking in activeStackChans race window Currently activeStackChans is set before a goroutine blocks on a channel operation in an unlockf passed to gop... — committed to golang/go by mknyszek 4 years ago
- [release-branch.go1.14] runtime: disable stack shrinking in activeStackChans race window Currently activeStackChans is set before a goroutine blocks on a channel operation in an unlockf passed to gop... — committed to golang/go by mknyszek 4 years ago
- [release-branch.go1.15] runtime: disable stack shrinking in activeStackChans race window Currently activeStackChans is set before a goroutine blocks on a channel operation in an unlockf passed to gop... — committed to claucece/go by mknyszek 4 years ago
- golang: v↑ go1.14 (1.14.9 -> 1.14.10) Going Go1.14.9 -> Go1.14.10 brings in compiler and runtime fixes including fix for crash in garbage-collector due to race condition: https://github.com/golang/g... — committed to SlapOS/slapos by navytux 4 years ago
OK I think I have a fix (trybots look good so far, 🤞), but still no working test. The goal of the test I’m trying to write is to create a situation where a goroutine whose stack is ripe for shrinking keeps going to sleep and waking up on channels. Meanwhile, a GC is triggered to try and get a mark worker to race with that goroutine going to sleep. I wrote a test which I’ve confirmed races stack shrinks with channel sends and receives, but I’m noticing a couple problems in trying to reproduce the issue.
export_test.gowhich doessuspendGfollowed byshrinkstackin a loop. This way we don’t need to wait for the whole GC to end before trying again.As a result of these problems, I’m not certain that this is feasible to write a test for, at least not a short-running one, if we want it to fail reliably. But, maybe I’m just going about this the wrong way (i.e. a stress test is just the wrong choice here)? Any advice?
By the way, I look forward to the test case.