Description
What version of Go are you using (go version
)?
$ go version go version devel +60f14fddfe Wed Jun 26 16:35:14 2019 +0000 darwin/amd64
Does this issue reproduce with the latest release?
The issue reproduces on go1.12.7 and go1.13beta1.
What operating system and processor architecture are you using (go env
)?
go env
Output
$ go env GO111MODULE="" GOARCH="amd64" GOBIN="" GOCACHE="/Users/dpolans/Library/Caches/go-build" GOENV="/Users/dpolans/Library/Application Support/go/env" GOEXE="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="darwin" GONOPROXY="" GONOSUMDB="" GOOS="darwin" GOPATH="/Users/dpolans/dev" GOPRIVATE="" GOPROXY="https://proxy.golang.org,direct" GOROOT="/usr/local/go" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64" GCCGO="gccgo" AR="ar" CC="clang" CXX="clang++" CGO_ENABLED="1" GOMOD="/usr/local/go/src/go.mod" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/cm/rp_0xvp15fn5mp4d672t2qsxkl2s83/T/go-build643112776=/tmp/go-build -gno-record-gcc-switches -fno-common"
What did you do?
I have a (production, latency-sensitive) Go service that has a median goroutine count of about 300,000 during steady state operations. The program periodically collects profiling data, including goroutine profiles.
What did you expect to see?
I expected the application’s performance behavior to be only slightly affected by collecting the goroutine profile (as is the case for collecting a heap profile, mutex contention profile, or CPU profile). I expected the magnitude of the latency impact to not scale with the number of goroutines (as the performance impact of heap profiling doesn’t change significantly with the size of the heap).
I expect some latency impact from a stop-the-world pause to prepare for the profile. I do not expect the duration of that pause to scale in proportion to the number of goroutines.
What did you see instead?
When Goroutine profiling is enabled, the 99th percentile latency for certain operations significantly increases. I have other programs that collect goroutine profiles with far fewer goroutines that do not incur the same latency spikes.
The benchmark attempts to replicate the latency spike by emulating the work done by the real application. The real application has one goroutine that periodically requests a goroutine profile while the other goroutines do some operation and report how long it took. The benchmark sets up a scenario with one goroutine that continuously collects goroutine profiles, n idle goroutines, and one goroutine that reports the duration of sleeping for 1 millisecond. The benchmark runs with several n values to demonstrate the latency impact as n grows. The results here show that when a goroutine profile is being collected, the p90, p99, and p99.9 wall-clock time the reproducer takes to sleep for 1ms increases with the number of live goroutines.
$ go test -bench=. -benchtime 1000x
goos: darwin
goarch: amd64
pkg: playground/runtime-slow-goroutineprofile
BenchmarkGoroutineProfile/1e5_other-8 1000 1369474 ns/op 4.97 ms-max 0.477 ms-p90 0.507 ms-p99 0.532 ms-p99.9 1001 work-count
BenchmarkGoroutineProfile/2e5_other-8 1000 1376884 ns/op 0.538 ms-max 0.477 ms-p90 0.516 ms-p99 0.536 ms-p99.9 999 work-count
BenchmarkGoroutineProfile/5e5_other-8 1000 1365024 ns/op 4.89 ms-max 0.476 ms-p90 0.507 ms-p99 0.544 ms-p99.9 1001 work-count
BenchmarkGoroutineProfile/1e6_other-8 1000 1377884 ns/op 3.91 ms-max 0.478 ms-p90 0.522 ms-p99 0.900 ms-p99.9 1002 work-count
BenchmarkGoroutineProfile/1e5_profile-8 1000 83836454 ns/op 344 ms-max 148 ms-p90 284 ms-p99 338 ms-p99.9 579 work-count
BenchmarkGoroutineProfile/2e5_profile-8 1000 111897245 ns/op 558 ms-max 261 ms-p90 342 ms-p99 514 ms-p99.9 430 work-count
BenchmarkGoroutineProfile/5e5_profile-8 1000 238983513 ns/op 1383 ms-max 629 ms-p90 1205 ms-p99 1318 ms-p99.9 379 work-count
BenchmarkGoroutineProfile/1e6_profile-8 1000 129128664 ns/op 2409 ms-max 1086 ms-p90 1234 ms-p99 2405 ms-p99.9 106 work-count
PASS
ok playground/runtime-slow-goroutineprofile 607.172s
Reproducer:
package issue
import (
"context"
"runtime"
"sort"
"sync"
"sync/atomic"
"testing"
"time"
)
func BenchmarkGoroutineProfile(b *testing.B) {
testcase := func(n int, profile bool) func(b *testing.B) {
return func(b *testing.B) {
var wg sync.WaitGroup
ctx, cancel := context.WithCancel(context.Background())
sr := make([]runtime.StackRecord, 2e6)
// create bystanders
for i := 0; i < n; i++ {
wg.Add(1)
go func() {
<-ctx.Done()
wg.Done()
}()
}
// give the bystanders a moment to start up
time.Sleep(1000 * time.Millisecond)
// our villain
var workCount uint64
wg.Add(1)
go func() {
defer wg.Done()
for ctx.Err() == nil {
atomic.AddUint64(&workCount, 1)
if profile {
runtime.GoroutineProfile(sr) // this function iterates over runtime.allgs while all other work is stopped
}
time.Sleep(1 * time.Millisecond)
}
}()
// record how long it takes to sleep for 1 millisecond b.N times
times := make([]float64, b.N)
sleep := 1 * time.Millisecond
b.ResetTimer()
prev := time.Now()
for i := 0; i < b.N; i++ {
time.Sleep(sleep)
now := time.Now()
times[i] = (now.Sub(prev) - sleep).Seconds() / time.Millisecond.Seconds()
prev = now
}
b.StopTimer()
cancel()
wg.Wait()
sort.Sort(sort.Reverse(sort.Float64Slice(times)))
b.ReportMetric(times[len(times)/10], "ms-p90")
b.ReportMetric(times[len(times)/100], "ms-p99")
b.ReportMetric(times[len(times)/1000], "ms-p99.9")
b.ReportMetric(times[0], "ms-max")
b.ReportMetric(float64(workCount), "work-count")
}
}
b.Run("1e5 other", testcase(1e5, false))
b.Run("2e5 other", testcase(2e5, false))
b.Run("5e5 other", testcase(5e5, false))
b.Run("1e6 other", testcase(1e6, false))
b.Run("1e5 profile", testcase(1e5, true))
b.Run("2e5 profile", testcase(2e5, true))
b.Run("5e5 profile", testcase(5e5, true))
b.Run("1e6 profile", testcase(1e6, true))
}
CC @aclements @rhysh