x/sync/errgroup: propagate panics and Goexits through Wait

### Background

The handling of panics and calls to `runtime.Goexit` in `x/sync/errgroup` has come up several times in its history:
* #40484 asked why we don't recover panics in goroutines.
* #49802 proposed a separate `panicgroup` API to propagate or handle panics
* [CL 131815](https://go.dev/cl/131815) pointed out that calls to `t.Fatal` and/or `t.Skip` within a `Group` in a test will generally result in either a hard-to-diagnose deadlock or an awkward half-aborted test, instead of skipping or failing the test immediately as expected.
    * (Compare #15758, #3800.)
* In my GopherCon 2018 talk, “[Rethinking Classical Concurrency Patterns](https://youtu.be/5zXAHh5tJqQ)”, I recommended that API authors “[m]ake concurrency an internal detail.” In multiple discussions after the talk, folks asked me how to handle panics in goroutines, and I realized that making concurrency an internal detail requires that we propagate panics (and `runtime.Goexit` calls) back to the caller's goroutine. (Otherwise, a concurrent call that panics would terminate the program, while a sequential call that panics would be recoverable!)

### Proposal

I propose that:

* The `(*Group).Wait` method should continue to wait for all goroutines in the group to exit, However, once that condition is met, if any of the goroutines in the group terminated with an unrecovered `panic`, `Wait` should panic with a value wrapping the first panic-value recovered from a goroutine in the group. Otherwise, if any of the goroutines exited via `runtime.Goexit` `Wait` should invoke `runtime.Goexit` on its own goroutine.
    * Because the runtime does not support saving and restoring the stack trace of a recovered panic, the value passed to `panic` by `Wait` should include a best-effort stack dump for the goroutine that initiated the panic.
    * Because some packages may use `recover` for error-handling (despite our advice to the contrary), if the recovered value implements the `error` interface, the value passed to `panic` by `Wait` should also implement the `error` interface, and should wrap the recovered error (so that it can be retrieved by `errors.Unwrap`).

* The `Context` value returned by `errgroup.WithContext` should be canceled as soon as any function call in the group returns a non-nil error, panics, or exits via `runtime.Goexit`.
   * All of these conditions indicate that `Wait` has an abnormal status to report, and thus should shut down all work associated with the `Group` so that the abnormal status can be reported quickly.

Specifically, if `Wait` panics, the panic-value would have either type `PanicValue` or type `PanicError`, defined as follows:
```go
// A PanicError wraps an error recovered from an unhandled panic
// when calling a function passed to Go or TryGo.
type PanicError struct {
	Recovered error
	Stack     []byte
}

func (p PanicError) Error() string {
	// A Go Error method conventionally does not include a stack dump, so omit it
	// here. (Callers who care can extract it from the Stack field.)
	return fmt.Sprintf("recovered from errgroup.Group: %v", p.Recovered)
}

func (p PanicError) Unwrap() error { return p.Recovered }

// A PanicValue wraps a value that does not implement the error interface,
// recovered from an unhandled panic when calling a function passed to Go or
// TryGo.
type PanicValue struct {
	Recovered interface{}
	Stack     []byte
}

func (p PanicValue) String() string {
	if len(p.Stack) > 0 {
		return fmt.Sprintf("recovered from errgroup.Group: %v\n%s", p.Recovered, p.Stack)
	}
	return fmt.Sprintf("recovered from errgroup.Group: %v", p.Recovered)
}
```

### Compatibility

Any program that today initiates an unrecovered `panic` within a `Go` or `TryGo` callback terminates due to that unrecovered panic, So recovering and propagating such a `panic` can only change broken programs into non-broken ones; it cannot break any program that was not already broken.

A valid program _could_ in theory call `runtime.Goexit` from within a `Go` callback today. However, the vast majority of calls to `runtime.Goexit` are via `testing.T` methods, and according to the documentation for those methods today they “must be called from the goroutine running the test or benchmark function, not from other goroutines created during the test.” Moreover, it would be possible to implement the documented `errgroup.Group` API today in a way that would cause `Wait` to always deadlock if `runtime.Goexit` were called, so any caller relying on the existing `runtime.Goexit` behavior is assuming an implementation detail that is not guaranteed.

In light of the above, I believe that the proposed changes are backward-compatible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

x/sync/errgroup: propagate panics and Goexits through Wait #53757

Background

Proposal

Compatibility

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

x/sync/errgroup: propagate panics and Goexits through Wait #53757

Description

Background

Proposal

Compatibility

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions