Skip to content

runtime: GoroutineProfile causes latency spikes in high-goroutine-count application #33250

Closed
@dpolansky

Description

@dpolansky

What version of Go are you using (go version)?

$ go version
go version devel +60f14fddfe Wed Jun 26 16:35:14 2019 +0000 darwin/amd64

Does this issue reproduce with the latest release?

The issue reproduces on go1.12.7 and go1.13beta1.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/dpolans/Library/Caches/go-build"
GOENV="/Users/dpolans/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/dpolans/dev"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/usr/local/go/src/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/cm/rp_0xvp15fn5mp4d672t2qsxkl2s83/T/go-build643112776=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I have a (production, latency-sensitive) Go service that has a median goroutine count of about 300,000 during steady state operations. The program periodically collects profiling data, including goroutine profiles.

What did you expect to see?

I expected the application’s performance behavior to be only slightly affected by collecting the goroutine profile (as is the case for collecting a heap profile, mutex contention profile, or CPU profile). I expected the magnitude of the latency impact to not scale with the number of goroutines (as the performance impact of heap profiling doesn’t change significantly with the size of the heap).

I expect some latency impact from a stop-the-world pause to prepare for the profile. I do not expect the duration of that pause to scale in proportion to the number of goroutines.

What did you see instead?

When Goroutine profiling is enabled, the 99th percentile latency for certain operations significantly increases. I have other programs that collect goroutine profiles with far fewer goroutines that do not incur the same latency spikes.

The benchmark attempts to replicate the latency spike by emulating the work done by the real application. The real application has one goroutine that periodically requests a goroutine profile while the other goroutines do some operation and report how long it took. The benchmark sets up a scenario with one goroutine that continuously collects goroutine profiles, n idle goroutines, and one goroutine that reports the duration of sleeping for 1 millisecond. The benchmark runs with several n values to demonstrate the latency impact as n grows. The results here show that when a goroutine profile is being collected, the p90, p99, and p99.9 wall-clock time the reproducer takes to sleep for 1ms increases with the number of live goroutines.

$ go test -bench=. -benchtime 1000x
goos: darwin
goarch: amd64
pkg: playground/runtime-slow-goroutineprofile
BenchmarkGoroutineProfile/1e5_other-8         	    1000	   1369474 ns/op	         4.97 ms-max	         0.477 ms-p90	         0.507 ms-p99	         0.532 ms-p99.9	      1001 work-count
BenchmarkGoroutineProfile/2e5_other-8         	    1000	   1376884 ns/op	         0.538 ms-max	         0.477 ms-p90	         0.516 ms-p99	         0.536 ms-p99.9	       999 work-count
BenchmarkGoroutineProfile/5e5_other-8         	    1000	   1365024 ns/op	         4.89 ms-max	         0.476 ms-p90	         0.507 ms-p99	         0.544 ms-p99.9	      1001 work-count
BenchmarkGoroutineProfile/1e6_other-8         	    1000	   1377884 ns/op	         3.91 ms-max	         0.478 ms-p90	         0.522 ms-p99	         0.900 ms-p99.9	      1002 work-count
BenchmarkGoroutineProfile/1e5_profile-8       	    1000	  83836454 ns/op	       344 ms-max	       148 ms-p90	       284 ms-p99	       338 ms-p99.9	       579 work-count
BenchmarkGoroutineProfile/2e5_profile-8       	    1000	 111897245 ns/op	       558 ms-max	       261 ms-p90	       342 ms-p99	       514 ms-p99.9	       430 work-count
BenchmarkGoroutineProfile/5e5_profile-8       	    1000	 238983513 ns/op	      1383 ms-max	       629 ms-p90	      1205 ms-p99	      1318 ms-p99.9	       379 work-count
BenchmarkGoroutineProfile/1e6_profile-8       	    1000	 129128664 ns/op	      2409 ms-max	      1086 ms-p90	      1234 ms-p99	      2405 ms-p99.9	       106 work-count
PASS
ok  	playground/runtime-slow-goroutineprofile	607.172s

Reproducer:

package issue

import (
	"context"
	"runtime"
	"sort"
	"sync"
	"sync/atomic"
	"testing"
	"time"
)

func BenchmarkGoroutineProfile(b *testing.B) {
	testcase := func(n int, profile bool) func(b *testing.B) {
		return func(b *testing.B) {
			var wg sync.WaitGroup
			ctx, cancel := context.WithCancel(context.Background())
			sr := make([]runtime.StackRecord, 2e6)

			// create bystanders
			for i := 0; i < n; i++ {
				wg.Add(1)
				go func() {
					<-ctx.Done()
					wg.Done()
				}()
			}
			// give the bystanders a moment to start up
			time.Sleep(1000 * time.Millisecond)

			// our villain
			var workCount uint64
			wg.Add(1)
			go func() {
				defer wg.Done()

				for ctx.Err() == nil {
					atomic.AddUint64(&workCount, 1)

					if profile {
						runtime.GoroutineProfile(sr) // this function iterates over runtime.allgs while all other work is stopped
					}
					time.Sleep(1 * time.Millisecond)
				}
			}()

			// record how long it takes to sleep for 1 millisecond b.N times
			times := make([]float64, b.N)
			sleep := 1 * time.Millisecond

			b.ResetTimer()
			prev := time.Now()
			for i := 0; i < b.N; i++ {
				time.Sleep(sleep)
				now := time.Now()
				times[i] = (now.Sub(prev) - sleep).Seconds() / time.Millisecond.Seconds()
				prev = now
			}
			b.StopTimer()

			cancel()
			wg.Wait()

			sort.Sort(sort.Reverse(sort.Float64Slice(times)))
			b.ReportMetric(times[len(times)/10], "ms-p90")
			b.ReportMetric(times[len(times)/100], "ms-p99")
			b.ReportMetric(times[len(times)/1000], "ms-p99.9")
			b.ReportMetric(times[0], "ms-max")
			b.ReportMetric(float64(workCount), "work-count")
		}
	}

	b.Run("1e5 other", testcase(1e5, false))
	b.Run("2e5 other", testcase(2e5, false))
	b.Run("5e5 other", testcase(5e5, false))
	b.Run("1e6 other", testcase(1e6, false))
	b.Run("1e5 profile", testcase(1e5, true))
	b.Run("2e5 profile", testcase(2e5, true))
	b.Run("5e5 profile", testcase(5e5, true))
	b.Run("1e6 profile", testcase(1e6, true))
}

CC @aclements @rhysh

Metadata

Metadata

Assignees

No one assigned

    Labels

    FrozenDueToAgeNeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.Performance

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions