Optimize DynamicRateLimiter to not constantly re-evaluate RPS #6842

natemort · 2025-04-21T20:27:51Z

While dynamic config lookups are relatively cheap, they're certainly not free and perform several allocations further contributing to GC times. Making matters worse, quotas.RateLimiter has some strange TTL logic such that the result of evaluating the dynamic config value isn't used more than once a minute unless it's lower than the current value.

Delete quotas.RateLimiter in favor of clock.RateLimiter and move the TTL to DynamicRateLimiter. Reduce the TTL to a more reasonable value(5s) and only evaluate the function if the time has elapsed.

Remove the logic allowing the rate change to bypass the TTL if its lower than the current rate. This requires evaluating the RPS value constantly. Instead we've shortened the TTL such that we'll reliably pick up changes within a few seconds regardless of the direction of the change.

The main place that the TTL logic seems to be relevant is the Task Dispatch limiter within Matching. Each poller includes an RPS limit and we'd attempt to update the RPS on each request. This is the only place that explicitly provided a TTL to quotas.RateLimiter (60s) rather than relying on the default.

Change the Matching Rate limiter to use DynamicRateLimiter so that it also only updates according to its TTL. This is a change in behavior and will make Matching less responsive to changes specified by user requests. It still complies with the "take most recent value" behavior that is advertised.

What changed?

Reduced the frequency at which we evaluate dynamic rate limits
Deleted quotas.RateLimiter in favor of clock.RateLimiter and quotas.DynamicRateLimiter
Changed Task Matcher rate limiter to use quotas.DynamicRateLimiter, removing the behavior that it adjusts downwards immediately but only upwards once every 60 seconds. It now updates to the last received value every 60 seconds.

Why?

Reduce Matching CPU usage by 2-3%, plus whatever GC/runtime gains we get.
Simplify codebase by removing redundant RateLimiter wrapper

How did you test it?

Unit/integration tests

Potential risks

Rate limiting is used widely across Cadence and complex. Errors here are bad.

Release notes

Documentation Changes

Groxx · 2025-04-21T20:45:18Z

common/quotas/dynamicratelimiter.go

+	return func(limiter *DynamicRateLimiter) {
+		limiter.ttl = ttl
+	}


tbh I think func-closure-options are the worst option :\ they can't be introspected for tests or logs or comparison, and they're extremely hard to discover or go-to-source/impl/etc due to the weak typing they force.

could we change this to just an opts-struct? static typing is a heck of a lot more useful for stuff like this, we don't need years-long-API-stability-at-the-cost-of-ergonomics stuff.

Groxx · 2025-04-21T20:48:35Z

common/quotas/dynamicratelimiter.go

+func (d *DynamicRateLimiter) maybeRefreshRps() {
+	now := d.timeSource.Now()
+	lastUpdated := d.lastUpdateTime.Load().(time.Time)
+	if now.After(lastUpdated.Add(d.ttl-1)) && d.lastUpdateTime.CompareAndSwap(lastUpdated, now) {


tbh I'd just drop the -1 for simplicity, I doubt we care about nanosecond accuracy :) and if you move the tests to jump to some time in the middle, rather than exact boundaries, it won't matter if this changes slightly (which is good because it doesn't matter if it changes slightly - tests don't need to lock down behavior like that)

generally tho LGTM, and I don't think we care about interleaving calls - if getLimitAndBurst is "slow" then all values are valid-enough and the next update will fix it anyway.

I like the -1 because it allows for setting a ttl of 0 which fully disables this behavior. Otherwise you'd need to set the TTL to -1 which seems counter intuitive.

yea, 0 is a pretty reasonable value for that 🤔 sounds good enough to me.
(I guess the alternative would be to switch to now.Compare(lastUpdated.Add(d.ttl)) > 0? that's not terrible, but also not obviously worth it, and it wouldn't change the -1/+1 fiddling in tests)

Groxx · 2025-04-21T22:36:50Z

common/dynamicconfig/quotas/fallbackdynamicratelimiterfactory.go

+		if primaryLimit := f.primary(domain); primaryLimit > 0 {
+			return float64(primaryLimit)
+		}
+		return float64(f.secondary())


I kinda hate this behavior, but yea. reasonable rewrite 👍

Groxx · 2025-04-21T22:39:16Z

common/quotas/dynamicratelimiter.go

+	rl             clock.Ratelimiter
+	timeSource     clock.TimeSource
+	ttl            time.Duration
+	lastUpdateTime atomic.Value


I think we have a generics version of this in go.uber.org/atomic or similar

Switched to atomic.Pointer. atomic.Value seems to use a pointer under the hood (since there isn't a way to atomically rewrite that much memory at once) and it gives us generics. There's an atomic.Time but it doesn't provide CAS support.

We could alternative use an atomic uint64 for millis since epoch to eliminate the pointers but that's probably negligible from a performance standpoint.

yea, perf difference tends to be quite minor, and pointers prevent a LOT of accidentally-racy stuff from being racy (copies, ==, others). and (strangely) go.uber.org/atomic doesn't embed a nocopy, only nocmp, so those aren't prevented strongly enough imo. the safety's a bit more worth it unless that changes.

common/quotas/limiter_test.go

Groxx · 2025-04-21T22:46:48Z

common/quotas/limiter_test.go

+	cases := []struct {
+		name       string
+		ttl        time.Duration
+		elapsed    time.Duration
+		initialRps int
+		newRps     int
+		expected   rate.Limit
+	}{
+		{
+			name:       "success",
+			ttl:        time.Second,
+			elapsed:    time.Second - 1,
+			initialRps: 1,
+			newRps:     0,
+			expected:   rate.Limit(1),
+		},
+		{
+			name:       "refreshed",
+			ttl:        time.Second,
+			elapsed:    time.Second,
+			initialRps: 1,
+			newRps:     0,
+			expected:   rate.Limit(0),
+		},
+	}
+	for _, tc := range cases {
+		t.Run(tc.name, func(t *testing.T) {
+			mockTime := clock.NewMockedTimeSource()
+			rps := tc.initialRps
+			rpsFunc := func() float64 {
+				res := rps
+				rps = tc.newRps
+				return float64(res)
+			}
+			limiter := NewDynamicRateLimiter(rpsFunc, WithTimeSource(mockTime), WithTTL(tc.ttl), WithMinBurst(1))
+
+			mockTime.Advance(tc.elapsed)
+			res := limiter.Limit()
+			assert.Equal(t, tc.expected, res)
+		})
+	}


not actually requesting any changes, the tests achieve what they need to. just a "maybe rethink this pattern" note.

this is essentially "update changes the limit", yea?

so it's

var rps float64 limiter := NewDynamicRateLimiter(func() float64 { return rps }, WithTimeSource(mockTime), WithTTL(time.Second), WithMinBurst(1)) assert.Equal(t, limiter.Limit(), 0) mockTime.Advance(time.Second) rps = 1 assert.Equal(t, limiter.Limit(), 1)

but with loooots of boilerplate and unused fields.

(I think having this separate from the repeated-calls is reasonable - it helps narrow down where an issue is, if one or both fail. this is just A Lot™ for no apparent reason, and mismatched types for no apparent reason either (int vs float64, requiring casts, though this is true on all of them))

Done, I think that's a good simplification.

Groxx · 2025-04-21T22:54:36Z

common/quotas/ratelimiter.go

-	if rl.maxDispatchPerSecond != nil {
-		return rate.Limit(*rl.maxDispatchPerSecond)
-	}
-	return rate.Inf


I'm reasonably sure this is not ever used in current code, but to call it out as a caution: you've definitely checked this behavior specifically?

for the rest of this file: 🎉

Other than Matcher, every use case uses DynamicRateLimiter which has a function returning float64, so there's no chance of nil. Matcher meanwhile has a default value specified in the config if a poller doesn't provide a rate (as all Decision workers don't provide one) so it won't ever pass nil.

Groxx · 2025-04-22T03:23:35Z

service/matching/tasklist/matcher.go

-	limiter *quotas.RateLimiter
+	limiter quotas.Limiter
+	// The most recently received Dispatch rate from a poller
+	lastReceivedRate atomic.Value


same generics-note

Groxx · 2025-04-22T03:30:29Z

service/matching/tasklist/matcher.go

-	if rps == nil {
-		return
+	if rps != nil {
+		tm.lastReceivedRate.Store(*rps)


hmmm. tbh this file's changes feel kinda odd imo (not just this line, I just had to pick somewhere). Rate() float64 as a signature feels like it's implying a rather lightweight op and probably reusable, but it's doing quite a lot like

numReadPartitionsFn := func(cfg *config.TaskListConfig) int { if cfg.EnableGetNumberOfPartitionsFromCache() { partitionConfig := tlMgr.TaskListPartitionConfig() // which calls func (c *taskListManagerImpl) TaskListPartitionConfig() *types.TaskListPartitionConfig { c.partitionConfigLock.RLock()

and... that means it's both exposed in a "probably reusable" way and called in an uncontrolled / unknown way by the ratelimiter itself (the interface doesn't imply "this is definitely cached and not called super frequently", but it's relying on that behavior)

maybe this'd be a better place to use a clock.Ratelimiter (so it's settable), where it'd be almost a drop-in replacement? or is this particular call stack very high frequency?

The Matcher is the only place that was actually using quotas.RateLimiter directly, so it's definitely unique compared to the other changes.

UpdateRatelimit is actually called for every single poll request, passing in the TaskList rate limit. I think this the motivation for having the debounce behavior in quotas.RateLimiter originally and it's the only scenario where a TTL is specified.

If we update the rate limit every time it'll swing back and forth as changes to pollers roll out. We definitely want some debouncing here.

Rate is only called by TaskListManager#describeTaskList and now also passed to the DynamicRateLimiter, so it gets called much less often than UpdateRatelimit, even if we had a low TTL.

yea, definitely agreed on debouncing, and doing that in Rate makes sense because it'll definitely eventually update. the behavior looks good, it just feels like a strange / forgettable control flow and expectations, unless it's documented somewhere that it's hard to miss.

we don't seem to have rate/rps/limit exposed almost anywhere except in matching here though so maybe that's fine. I thought there was more of a pattern of "current-limit-getters are lightweight" but that might just be because I'm familiar with the quotas/... package, and it isn't done anywhere else.

so I guess:

I'm game to keep it

it still feels a bit surprising, maybe some docs would be worth adding?

no complaint if nothing changes 👍

Groxx

(bleh, clicked the wrong one, sorry)

Groxx

Broadly very good and glad to see it, just some smaller stuff to consider (not force!)...

... except maybe that tasklist/matcher.go change. that feels potentially riskier than we should keep, and prone to missed expectations in future rewrites.
(and generic-atomics, that seems obviously preferred)

so marking 'no' just to push discussion and mark myself down for the next round. likely easy 👍

Groxx

yea, seems good. left some comments but I think it's good to go as-is too 👍

Groxx · 2025-04-22T21:35:51Z

common/quotas/limiter_test.go

+	assert.ErrorIs(t, err, clock.ErrCannotWait)
+
+	mockTime.Advance(time.Second)
+	err = limiter.Wait(ctx)


I believe there's a non-obvious assumption in here because of how the clock stuff interacts, which is... annoying and probably not worth addressing, but pointing it out in case you get any ideas:

because the Wait-internal sleep is using the mock-time-source, this call will block forever if there isn't a full token available (i.e. if the call isn't fully non-blocking). mock-time won't advance while it's waiting, so it'll never un-sleep :|

I've wanted to make a mock-time-Context to address this, because it's a nasty surprise when writing these tests and suddenly getting long delays and timeouts rather than useful errors. but we don't have one yet.

the manual way to handle this is to launch a goroutine to advance mock-time if it delays for more than like 100ms, but it's quite a lot of boilerplate.

While dynamic config lookups are relatively cheap, they're certainly not free and perform several allocations further contributing to GC times. Making matters worse, quotas.RateLimiter has some strange TTL logic such that the result of evaluating the dynamic config value isn't used more than once a minute unless it's lower than the current value. Delete quotas.RateLimiter in favor of clock.RateLimiter and move the TTL to DynamicRateLimiter. Reduce the TTL to a more reasonable value and only evaluate the function if the time has elapsed. Remove the logic allowing the rate change to bypass the TTL if its lower than the current rate. This requires evaluating the RPS value constantly. Instead we've shortened the TTL such that we'll reliably pick up changes within a few seconds regardless of the direction of the change. The main place that the TTL logic seems to be relevant is the Task Dispatch limiter within Matching. Each poller includes an RPS limit and we'd attempt to update the RPS on each request. This is the only place that explicitly provided a TTL to quotas.RateLimiter (60s) rather than relying on the default. Change the Matching Rate limiter to use DynamicRateLimiter so that it also only updates according to its TTL. This is a change in behavior and will make Matching less responsive to changes specified by user requests. It still complies with the "take most recent value" behavior that is advertised.

natemort requested review from Shaddoll, neil-xie, davidporter-id-au, Groxx, shijiesheng, jakobht, 3vilhamster, sankari165, dkrotx, taylanisikdemir and demirkayaender as code owners April 21, 2025 20:27

Groxx reviewed Apr 21, 2025

View reviewed changes

common/quotas/limiter_test.go Show resolved Hide resolved

Groxx reviewed Apr 21, 2025

View reviewed changes

Groxx reviewed Apr 22, 2025

View reviewed changes

Groxx approved these changes Apr 22, 2025

View reviewed changes

Groxx requested changes Apr 22, 2025

View reviewed changes

natemort force-pushed the ratelimiter branch from 18819e3 to 0a3bb6f Compare April 22, 2025 20:02

Groxx approved these changes Apr 22, 2025

View reviewed changes

Groxx reviewed Apr 22, 2025

View reviewed changes

taylanisikdemir approved these changes Apr 22, 2025

View reviewed changes

natemort force-pushed the ratelimiter branch from 0a3bb6f to 6c0af78 Compare April 28, 2025 18:24

natemort merged commit 6e55181 into cadence-workflow:master Apr 28, 2025
23 checks passed

Optimize DynamicRateLimiter to not constantly re-evaluate RPS #6842

Optimize DynamicRateLimiter to not constantly re-evaluate RPS #6842

Uh oh!

Conversation

natemort commented Apr 21, 2025

Uh oh!

Groxx Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Groxx Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Groxx Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Groxx Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Groxx Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Groxx Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Groxx Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Groxx left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Groxx left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Groxx left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Groxx Apr 21, 2025 •

edited

Loading

Groxx Apr 21, 2025 •

edited

Loading

Groxx Apr 21, 2025 •

edited

Loading

Groxx Apr 21, 2025 •

edited

Loading

Groxx Apr 21, 2025 •

edited

Loading

Groxx Apr 22, 2025 •

edited

Loading

Groxx Apr 22, 2025 •

edited

Loading

Groxx left a comment •

edited

Loading

Groxx left a comment •

edited

Loading