Skip to content

x/sync/errgroup: propagate panics and Goexits through Wait #53757

Closed
@bcmills

Description

@bcmills

Background

The handling of panics and calls to runtime.Goexit in x/sync/errgroup has come up several times in its history:

Proposal

I propose that:

  • The (*Group).Wait method should continue to wait for all goroutines in the group to exit, However, once that condition is met, if any of the goroutines in the group terminated with an unrecovered panic, Wait should panic with a value wrapping the first panic-value recovered from a goroutine in the group. Otherwise, if any of the goroutines exited via runtime.Goexit Wait should invoke runtime.Goexit on its own goroutine.

    • Because the runtime does not support saving and restoring the stack trace of a recovered panic, the value passed to panic by Wait should include a best-effort stack dump for the goroutine that initiated the panic.
    • Because some packages may use recover for error-handling (despite our advice to the contrary), if the recovered value implements the error interface, the value passed to panic by Wait should also implement the error interface, and should wrap the recovered error (so that it can be retrieved by errors.Unwrap).
  • The Context value returned by errgroup.WithContext should be canceled as soon as any function call in the group returns a non-nil error, panics, or exits via runtime.Goexit.

    • All of these conditions indicate that Wait has an abnormal status to report, and thus should shut down all work associated with the Group so that the abnormal status can be reported quickly.

Specifically, if Wait panics, the panic-value would have either type PanicValue or type PanicError, defined as follows:

// A PanicError wraps an error recovered from an unhandled panic
// when calling a function passed to Go or TryGo.
type PanicError struct {
	Recovered error
	Stack     []byte
}

func (p PanicError) Error() string {
	// A Go Error method conventionally does not include a stack dump, so omit it
	// here. (Callers who care can extract it from the Stack field.)
	return fmt.Sprintf("recovered from errgroup.Group: %v", p.Recovered)
}

func (p PanicError) Unwrap() error { return p.Recovered }

// A PanicValue wraps a value that does not implement the error interface,
// recovered from an unhandled panic when calling a function passed to Go or
// TryGo.
type PanicValue struct {
	Recovered interface{}
	Stack     []byte
}

func (p PanicValue) String() string {
	if len(p.Stack) > 0 {
		return fmt.Sprintf("recovered from errgroup.Group: %v\n%s", p.Recovered, p.Stack)
	}
	return fmt.Sprintf("recovered from errgroup.Group: %v", p.Recovered)
}

Compatibility

Any program that today initiates an unrecovered panic within a Go or TryGo callback terminates due to that unrecovered panic, So recovering and propagating such a panic can only change broken programs into non-broken ones; it cannot break any program that was not already broken.

A valid program could in theory call runtime.Goexit from within a Go callback today. However, the vast majority of calls to runtime.Goexit are via testing.T methods, and according to the documentation for those methods today they “must be called from the goroutine running the test or benchmark function, not from other goroutines created during the test.” Moreover, it would be possible to implement the documented errgroup.Group API today in a way that would cause Wait to always deadlock if runtime.Goexit were called, so any caller relying on the existing runtime.Goexit behavior is assuming an implementation detail that is not guaranteed.

In light of the above, I believe that the proposed changes are backward-compatible.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Accepted

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions