Skip to content

Commit 51ca977

Browse files
authored
Squash NaN bugs and prevent them from coming back. (#6375)
Addresses some rare (but damaging) issues in production that appear to be due to NaNs in aggregator responses, very likely due to sending 0-rps updates at unlucky times and/or update-ticks running more quickly than 1s (which may occur if an update takes >2s, as tickers queue up a "tick" and their time continues elapsing whether it is read or not). These NaNs do two very problematic things: - NaNs beget more NaNs, so they tend to "stick" for large periods of time - NaNs turn into `INT_MIN` when you `int(NaN)`, which related code tends to treat as `0` (e.g. the underlying ratelimiter does this for negative limits) - and this rejects all requests The fix is essentially two changes: - fix the math to not produce NaNs/etc - pretty obviously needed - detect NaNs/etc arriving over RPC and reject them up-front - bad data like this should now switch to using fallbacks, rather than misbehaving - this implies we are distrusting our own service, but... that's probably correct when deploys roll out. it's better to be preventative, when the blast radius is large and spreads rapidly. And the rest of the changes are devoted to preventing this from coming back in the future: - probably-excessive "detect unexpected values and log them" checks scattered everywhere - these would likely be fine to trim to just the final "assign or return" check, I don't feel strongly about keeping them everywhere like this. - a relatively fancy fuzz test that runs multiple rounds of updates and checks all data for "bad" values at every step - I started out with a simpler single-round one... but multiple rounds found another flaw, and multiple-round naturally covers single-round too so it didn't seem worth keeping. - I had originally thought that this probably needed fuzzing... and yep. it did. - some additional tests to cover more error cases, and the new branches --- While I believe this change is good enough to fix the problem, and worth hot-fixing ASAP, this all deserves a later change to add some "odd behavior detected, log details" and "log all data about key X" code, for improving observability when things act weird. If we had this logging earlier, we probably could've spotted this in less than an hour from the first instance, rather than weeks later. It would also have been helpful to explain some strong load-imbalance events; I suspect they are "valid" and reasonable behavior (weights heavily skewed because requests were heavily skewed), but it's hard to prove or disprove that without fine-grained data, and that also means it's hard to help our customers understand things. --- Fuzz testing specifically, I've included some of the failed runs while I was developing it, to seed the fuzzer, and these are run as part of normal `go test` runs so they also help ensure coverage. Some of the scenarios took ~20 minutes of fuzzing to catch though, so as is normal with fuzz testing: If you make changes and you are not totally confident in them, fuzzing for *hours* is sorta the baseline expectation. Though anything is better than nothing. And please commit any failed runs, if they're not super (unnecessarily) large!
1 parent ef546af commit 51ca977

24 files changed

+737
-47
lines changed

common/quotas/global/algorithm/requestweighted.go

Lines changed: 62 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -130,7 +130,10 @@ import (
130130

131131
"github.com/uber/cadence/common/clock"
132132
"github.com/uber/cadence/common/dynamicconfig"
133+
"github.com/uber/cadence/common/log"
134+
"github.com/uber/cadence/common/log/tag"
133135
"github.com/uber/cadence/common/metrics"
136+
"github.com/uber/cadence/common/quotas/global/shared"
134137
)
135138

136139
type (
@@ -155,8 +158,9 @@ type (
155158
impl struct {
156159
// intentionally value-typed so caller cannot mutate the fields.
157160
// manually copy data if this changes.
158-
cfg Config
159-
scope metrics.Scope
161+
cfg Config
162+
scope metrics.Scope
163+
logger log.Logger
160164

161165
// mut protects usage, as it is the only mutable data
162166
mut sync.Mutex
@@ -266,6 +270,11 @@ type (
266270
const (
267271
guessNumKeys = 1024 // guesstimate at num of ratelimit keys in a cluster
268272
guessHostCount = 32 // guesstimate at num of frontend hosts in a cluster that receive traffic for each key
273+
274+
// Some event-RPSes exceed 100k, but none get close to 1m.
275+
// So, to add some buffer, 10m per second is considered "impossible".
276+
// This is quite vague and can be changed, it essentially just serves as the logging threshold.
277+
guessImpossibleRps = 10_000_000
269278
)
270279

271280
func (p UpdateParams) Validate() error {
@@ -356,11 +365,12 @@ func (c configSnapshot) missedUpdateScalar(dataAge time.Duration) PerSecond {
356365
//
357366
// This instance is effectively single-threaded, but a small sharding wrapper should allow better concurrent
358367
// throughput if needed (bound by CPU cores, as it's moderately CPU-costly).
359-
func New(met metrics.Client, cfg Config) (RequestWeighted, error) {
368+
func New(met metrics.Client, logger log.Logger, cfg Config) (RequestWeighted, error) {
360369
i := &impl{
361-
cfg: cfg,
362-
scope: met.Scope(metrics.GlobalRatelimiterAggregator),
363-
usage: make(map[Limit]map[Identity]requests, guessNumKeys), // start out relatively large
370+
cfg: cfg,
371+
scope: met.Scope(metrics.GlobalRatelimiterAggregator),
372+
logger: logger.WithTags(tag.ComponentGlobalRatelimiter),
373+
usage: make(map[Limit]map[Identity]requests, guessNumKeys), // start out relatively large
364374

365375
clock: clock.NewRealTimeSource(),
366376
}
@@ -394,8 +404,29 @@ func (a *impl) Update(p UpdateParams) error {
394404

395405
var next requests
396406
prev := ih[p.ID]
397-
aps := PerSecond(float64(req.Accepted) / float64(p.Elapsed/time.Second))
398-
rps := PerSecond(float64(req.Rejected) / float64(p.Elapsed/time.Second))
407+
408+
// sanity check: elapsed time should not be less than 1s, so just cap it.
409+
// in practice this should always be safe with a >=1s configured rate, as
410+
// the caller should not send *more* frequently than every 1s (monotonic time).
411+
//
412+
// but this is rather easy to trigger in tests / fuzzing,
413+
// and extreme values lead to irrational math either way.
414+
elapsed := math.Max(float64(p.Elapsed), float64(time.Second))
415+
aps := shared.SanityLogFloat(0, PerSecond(float64(req.Accepted)/(elapsed/float64(time.Second))), guessImpossibleRps, "accepted rps", a.logger)
416+
rps := shared.SanityLogFloat(0, PerSecond(float64(req.Rejected)/(elapsed/float64(time.Second))), guessImpossibleRps, "rejected rps", a.logger)
417+
418+
// zeros are not worth recording, and this also simplifies math elsewhere
419+
// for two major reasons:
420+
// - it prevents some divide-by-zero scenarios by simply not having actual zeros
421+
// - it prevents weights from perpetually lowering if zeros are repeatedly sent, where they may eventually reach zero
422+
//
423+
// these keys will eventually gc, just leave them alone until that happens.
424+
// currently this gc relies on the assumption that HostUsage will be called with the same set of keys "soon",
425+
// but that is fairly easy to fix if needed.
426+
if rps+aps == 0 {
427+
continue
428+
}
429+
399430
if prev.lastUpdate.IsZero() {
400431
initialized++
401432
next = requests{
@@ -425,8 +456,8 @@ func (a *impl) Update(p UpdateParams) error {
425456
// TODO: max(1, actual) so this does not lead to <1 rps allowances? or maybe just 1+actual and then reduce in used-responses?
426457
// otherwise currently this may lead to rare callers getting 0.0001 rps,
427458
// and never recovering, despite steady and fair usage.
428-
accepted: weighted(aps, prev.accepted*reduce, snap.weight),
429-
rejected: weighted(rps, prev.rejected*reduce, snap.weight),
459+
accepted: shared.SanityLogFloat(0, weighted(aps, prev.accepted*reduce, snap.weight), guessImpossibleRps, "weighted accepted rps", a.logger),
460+
rejected: shared.SanityLogFloat(0, weighted(rps, prev.rejected*reduce, snap.weight), guessImpossibleRps, "weighted rejected rps", a.logger),
430461
}
431462
}
432463
}
@@ -464,7 +495,10 @@ func (a *impl) getWeightsLocked(key Limit, snap configSnapshot) (weights map[Ide
464495
continue
465496
}
466497

467-
reduce := snap.missedUpdateScalar(age)
498+
// should never be zero, `shouldGC` takes care of that.
499+
reduce := shared.SanityLogFloat(0, snap.missedUpdateScalar(age), 1, "missed update", a.logger)
500+
// similarly: should never be zero, accepted + rejected must be nonzero or they are not inserted.
501+
// this may be reduced to very low values, but still far from == 0.
468502
actual := HostWeight((reqs.accepted + reqs.rejected) * reduce)
469503

470504
weights[id] = actual // populate with the reduced values so it doesn't have to be calculated again
@@ -480,10 +514,20 @@ func (a *impl) getWeightsLocked(key Limit, snap configSnapshot) (weights map[Ide
480514
return nil, 0, met
481515
}
482516

517+
// zeros anywhere here should not be possible - they are prevented from being inserted,
518+
// and anything simply "losing weight" will only become "rather low", not zero,
519+
// before enough passes have occurred to garbage collect it.
520+
//
521+
// specifically, 1rps -> 0rps takes over 1,000 halving cycles, and a single 1rps event
522+
// during that will immediately re-set it above 0.5 and will need 1,000+ more cycles.
523+
//
524+
// if gc period / weight amount is set extreme enough this is "possible",
525+
// but we are unlikely to ever cause it.
483526
for id := range ir {
484-
// scale by the total.
485-
// this also ensures all values are between 0 and 1 (inclusive)
486-
weights[id] = weights[id] / total
527+
// normalize by the total.
528+
// this also ensures all values are between 0 and 1 (inclusive),
529+
// though zero should be impossible.
530+
weights[id] = shared.SanityLogFloat(0, weights[id]/total, 1, "normalized weight", a.logger)
487531
}
488532
met.Limits = 1
489533
return weights, usedRPS, met
@@ -511,11 +555,12 @@ func (a *impl) HostUsage(host Identity, limits []Limit) (usage map[Limit]HostUsa
511555

512556
if len(hosts) > 0 {
513557
usage[lim] = HostUsage{
514-
// limit is known, has some usage on at least one host
515-
Used: used,
558+
// limit is known, has some usage on at least one host.
559+
// usage has an "upper limit" because it is only the accepted RPS, not all requests received.
560+
Used: shared.SanityLogFloat(0, used, guessImpossibleRps, "used rps", a.logger),
516561
// either known weight if there is info for this host, or zero if not.
517562
// zeros are interpreted as "unknown", the same as "not present".
518-
Weight: hosts[host],
563+
Weight: shared.SanityLogFloat(0, hosts[host], 1, "computed weight", a.logger),
519564
}
520565
}
521566
}

0 commit comments

Comments
 (0)