go: runtime: race between stack shrinking and channel send/recv leads to bad sudog values

Internally we’ve seen a rare crash arise in the runtime since Go 1.14. The error message is typically sudog with non-nil elem stemming from a call to releaseSudog from chansend or chanrecv.

The issue here is a race between a mark worker and a channel operation. Consider the following sequence of events. GW is a worker G. GS is a G trying to send on a channel. GR is a G trying to receive on that same channel.

GW wants to suspend GS to scan its stack. It calls suspendG.
GS is about to gopark in chansend. It calls into gopark, and changes its status to _Gwaiting BEFORE calling its unlockf, which sets gp.activeStackChans.
GW observes _Gwaiting and returns from suspendG. It continues into scanstack where it checks if it’s safe to shrink the stack. In this case, it’s fine. So, it reads gp.activeStackChans, and sees it as false. It begins adjusting sudog pointers without synchronization. It reads the sudog’s elem pointer from the chansend, but has not written it back yet.
GS continues on its merry way and sets gp.activeStackChans and parks. It doesn’t really matter when this happens at this point.
GR comes in and wants to chanrecv on channel. It grabs the channel lock, reads from the sudog’s elem field, and clears it. GR readies GS.
GW then writes the updated sudog’s elem pointer and continues on its merry way.
Sometime later, GS wakes up because it was readied by GR, and tries to release the sudog, which has a non-nil elem field.

The fix here, I believe, is to set gp.activeStackChans before the unlockf is called. Doing this ensures that the value is updated before any worker that could shrink GS’s stack observes a useful G status in suspendG. This could alternatively be fixed by changing the G status after unlockf is called, but I worry that will break a lot of things.

CC @aclements @prattmic

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 26 (19 by maintainers)

Commits related to this issue

runtime: expand gopark documentation unlockf is called after the G is put into _Gwaiting, meaning another G may have readied this one before unlockf is called. This is implied by the current doc, bu... — committed to golang/go by prattmic 4 years ago
[release-branch.go1.15] runtime: disable stack shrinking in activeStackChans race window Currently activeStackChans is set before a goroutine blocks on a channel operation in an unlockf passed to gop... — committed to golang/go by mknyszek 4 years ago
[release-branch.go1.14] runtime: disable stack shrinking in activeStackChans race window Currently activeStackChans is set before a goroutine blocks on a channel operation in an unlockf passed to gop... — committed to golang/go by mknyszek 4 years ago
[release-branch.go1.15] runtime: disable stack shrinking in activeStackChans race window Currently activeStackChans is set before a goroutine blocks on a channel operation in an unlockf passed to gop... — committed to claucece/go by mknyszek 4 years ago
golang: v↑ go1.14 (1.14.9 -> 1.14.10) Going Go1.14.9 -> Go1.14.10 brings in compiler and runtime fixes including fix for crash in garbage-collector due to race condition: https://github.com/golang/g... — committed to SlapOS/slapos by navytux 4 years ago

Most upvoted comments

OK I think I have a fix (trybots look good so far, 🤞), but still no working test. The goal of the test I’m trying to write is to create a situation where a goroutine whose stack is ripe for shrinking keeps going to sleep and waking up on channels. Meanwhile, a GC is triggered to try and get a mark worker to race with that goroutine going to sleep. I wrote a test which I’ve confirmed races stack shrinks with channel sends and receives, but I’m noticing a couple problems in trying to reproduce the issue.

It’s relatively expensive to generate a stack shrink in the runtime, so stress testing this isn’t very effective. I could maybe get around this by writing an unsafe function in export_test.go which does suspendG followed by shrinkstack in a loop. This way we don’t need to wait for the whole GC to end before trying again.
The race window is really, really small compared to everything around it. This fact is evidenced not only by looking at the code, but also by the fact that this crash was fairly rare.

As a result of these problems, I’m not certain that this is feasible to write a test for, at least not a short-running one, if we want it to fail reliably. But, maybe I’m just going about this the wrong way (i.e. a stress test is just the wrong choice here)? Any advice?

mknyszek on Aug 11, 2020

By the way, I look forward to the test case.

ianlancetaylor on Aug 10, 2020