go: x/sync/errgroup: propagate panics and Goexits through Wait

Background

The handling of panics and calls to runtime.Goexit in x/sync/errgroup has come up several times in its history:

  • #40484 asked why we don’t recover panics in goroutines.
  • #49802 proposed a separate panicgroup API to propagate or handle panics
  • CL 131815 pointed out that calls to t.Fatal and/or t.Skip within a Group in a test will generally result in either a hard-to-diagnose deadlock or an awkward half-aborted test, instead of skipping or failing the test immediately as expected.
  • In my GopherCon 2018 talk, “Rethinking Classical Concurrency Patterns”, I recommended that API authors “[m]ake concurrency an internal detail.” In multiple discussions after the talk, folks asked me how to handle panics in goroutines, and I realized that making concurrency an internal detail requires that we propagate panics (and runtime.Goexit calls) back to the caller’s goroutine. (Otherwise, a concurrent call that panics would terminate the program, while a sequential call that panics would be recoverable!)

Proposal

I propose that:

  • The (*Group).Wait method should continue to wait for all goroutines in the group to exit, However, once that condition is met, if any of the goroutines in the group terminated with an unrecovered panic, Wait should panic with a value wrapping the first panic-value recovered from a goroutine in the group. Otherwise, if any of the goroutines exited via runtime.Goexit Wait should invoke runtime.Goexit on its own goroutine.

    • Because the runtime does not support saving and restoring the stack trace of a recovered panic, the value passed to panic by Wait should include a best-effort stack dump for the goroutine that initiated the panic.
    • Because some packages may use recover for error-handling (despite our advice to the contrary), if the recovered value implements the error interface, the value passed to panic by Wait should also implement the error interface, and should wrap the recovered error (so that it can be retrieved by errors.Unwrap).
  • The Context value returned by errgroup.WithContext should be canceled as soon as any function call in the group returns a non-nil error, panics, or exits via runtime.Goexit.

    • All of these conditions indicate that Wait has an abnormal status to report, and thus should shut down all work associated with the Group so that the abnormal status can be reported quickly.

Specifically, if Wait panics, the panic-value would have either type PanicValue or type PanicError, defined as follows:

// A PanicError wraps an error recovered from an unhandled panic
// when calling a function passed to Go or TryGo.
type PanicError struct {
	Recovered error
	Stack     []byte
}

func (p PanicError) Error() string {
	// A Go Error method conventionally does not include a stack dump, so omit it
	// here. (Callers who care can extract it from the Stack field.)
	return fmt.Sprintf("recovered from errgroup.Group: %v", p.Recovered)
}

func (p PanicError) Unwrap() error { return p.Recovered }

// A PanicValue wraps a value that does not implement the error interface,
// recovered from an unhandled panic when calling a function passed to Go or
// TryGo.
type PanicValue struct {
	Recovered interface{}
	Stack     []byte
}

func (p PanicValue) String() string {
	if len(p.Stack) > 0 {
		return fmt.Sprintf("recovered from errgroup.Group: %v\n%s", p.Recovered, p.Stack)
	}
	return fmt.Sprintf("recovered from errgroup.Group: %v", p.Recovered)
}

Compatibility

Any program that today initiates an unrecovered panic within a Go or TryGo callback terminates due to that unrecovered panic, So recovering and propagating such a panic can only change broken programs into non-broken ones; it cannot break any program that was not already broken.

A valid program could in theory call runtime.Goexit from within a Go callback today. However, the vast majority of calls to runtime.Goexit are via testing.T methods, and according to the documentation for those methods today they “must be called from the goroutine running the test or benchmark function, not from other goroutines created during the test.” Moreover, it would be possible to implement the documented errgroup.Group API today in a way that would cause Wait to always deadlock if runtime.Goexit were called, so any caller relying on the existing runtime.Goexit behavior is assuming an implementation detail that is not guaranteed.

In light of the above, I believe that the proposed changes are backward-compatible.

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Reactions: 17
  • Comments: 30 (27 by maintainers)

Commits related to this issue

Most upvoted comments

No change in consensus, so accepted. 🎉 This issue now tracks the work of implementing the proposal. — rsc for the proposal review group

@bcmills To me, while both “concurrently running goroutines will continue to execute while a panic unwinds the stack” and “the program will continue to run when something upstack from Wait recovers the panic” both create a situation where a program runs in an undefined state for some amount of time, the quantitative difference between these two is large enough to be a qualitative one.

I’m not vehemently opposed to this change. I think the cases where it would lead to real, practical problems are very rare. But I don’t think the change in behavior can just be swiped away as “eh, it’s broken anyways”.

Moreover, I’m personally not convinced that this is a good change in and off itself.

CL 131815 pointed out that calls to t.Fatal and/or t.Skip within a Group in a test will generally result in either a hard-to-diagnose deadlock or an awkward half-aborted test, instead of skipping or failing the test immediately as expected.

I would argue that this is simply observing that calling t.Fatal from a goroutine is a bug and that this bug is hard to diagnose in some cases. The fact that it is a bug, is called out in the docs.

This change would make it not-a-bug in some cases, but we certainly can’t make it not-a-bug in all cases. So the restriction would have to stay in the docs and people will continue to have to be aware of it. Arguably, to someone who is aware of it, t.Skip working when done in an errgroup is making the program behave less as expected, not more.

In my GopherCon 2018 talk, “Rethinking Classical Concurrency Patterns”, I recommended that API authors “[m]ake concurrency an internal detail.” In multiple discussions after the talk, folks asked me how to handle panics in goroutines, and I realized that making concurrency an internal detail requires that we propagate panics (and runtime.Goexit calls) back to the caller’s goroutine. (Otherwise, a concurrent call that panics would terminate the program, while a sequential call that panics would be recoverable!)

I don’t understand the scenario of concern.

  • Is the scenario that a library author takes a callback (e.g. as an interface) from the user and moves to calling that (multiple times) from an errgroup? Because in that case, this change isn’t an “internal detail” anyways, concurrent calls can introduce races, so if a callback can be called multiple times concurrently, that has to be part of the API.
  • Is the scenario that a library author calls library code concurrently and, by doing that change, changes from a recovering program to a crashing program? In this case, yes, the behavior would change, but it would only do so because of a bug in the library. I feel that “maintaining backwards compatibility in the presence of bugs” is a bit of an unrealistic stretch goal.
  • Is the scenario that a library user moves from calling the library sequentially to calling it concurrently? In that case, yes, the behavior of the program would change if the library panics, but that seems hardly the concern of the library author and whether concurrency is internal to the library.
  • So, the only scenario left I can imagine is that the library author takes a callback and moves it into an errgroup, calling it exactly once. That seems a niche scenario. It certainly happens, but ISTM that if the library author is concerned about the behavior of their library in the presence of panics and wants to make the propagating, they can do so by recovering manually, right?

In general, ISTM that there are two different mindsets regarding panic/recover in the community: One of “a panic should ~always crash the program” and one of “a panic should ~always be recovered on a best-effort basis”. I think the core problem is there is no consistent application of either of these mindsets in the community or the standard library and it leads to the worst situation for both kinds of people. As some libraries work from one mindset and some work from the other, an operator (or the author of an end-user program) can neither rely on panics crashing the program, nor rely on panics being recovered. Neither can a library author use panic as a reliable signal that something is wrong.

I find that frustrating. And to me, this change muddies the waters even further. But, as I said, I’m not vehemently opposed. If it happens, it happens. Without a consistent decision about which mindset to apply, the waters will stay muddy either way, this change on its own won’t really move the situation much in either direction.

The proposal has been accepted and there is a work-in-progress change at https://go.dev/cl/416555.

I support the addition of the “recover” behavior to the errgroup as an option. if it was set, we can replace f() with recoveredFn(f)()

func recoveredFn(f func() error) func() error {
	return func() (err error) {
		defer func() {
			if r := recover(); r != nil {
				buf := make([]byte, 64<<10)
				buf = buf[:runtime.Stack(buf, false)]
				err = fmt.Errorf("errgroup: panic recovered: %s\n%s", r, buf)
			}
		}()
		return f()
	}
}

Nothing should be using Goexit. Is the problem that errgroup is being used and the functions call t.Fatal or t.FailNow?

Not that it is, but that it isn’terrgroup could be a really useful way to coordinate, say, goroutines running test-helpers (such as mock servers) in a way that cleanly isolates them between independent tests.

But today errgroup is only of limited use in tests, because the callbacks passed to errgroup cannot safely invoke any test-helper that might call Skip or Fatal.