Skip to content

Optimize DynamicRateLimiter to not constantly re-evaluate RPS #6842

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 28, 2025

Conversation

natemort
Copy link
Member

While dynamic config lookups are relatively cheap, they're certainly not free and perform several allocations further contributing to GC times. Making matters worse, quotas.RateLimiter has some strange TTL logic such that the result of evaluating the dynamic config value isn't used more than once a minute unless it's lower than the current value.

Delete quotas.RateLimiter in favor of clock.RateLimiter and move the TTL to DynamicRateLimiter. Reduce the TTL to a more reasonable value(5s) and only evaluate the function if the time has elapsed.

Remove the logic allowing the rate change to bypass the TTL if its lower than the current rate. This requires evaluating the RPS value constantly. Instead we've shortened the TTL such that we'll reliably pick up changes within a few seconds regardless of the direction of the change.

The main place that the TTL logic seems to be relevant is the Task Dispatch limiter within Matching. Each poller includes an RPS limit and we'd attempt to update the RPS on each request. This is the only place that explicitly provided a TTL to quotas.RateLimiter (60s) rather than relying on the default.

Change the Matching Rate limiter to use DynamicRateLimiter so that it also only updates according to its TTL. This is a change in behavior and will make Matching less responsive to changes specified by user requests. It still complies with the "take most recent value" behavior that is advertised.

What changed?

  • Reduced the frequency at which we evaluate dynamic rate limits
  • Deleted quotas.RateLimiter in favor of clock.RateLimiter and quotas.DynamicRateLimiter
  • Changed Task Matcher rate limiter to use quotas.DynamicRateLimiter, removing the behavior that it adjusts downwards immediately but only upwards once every 60 seconds. It now updates to the last received value every 60 seconds.

Why?

  • Reduce Matching CPU usage by 2-3%, plus whatever GC/runtime gains we get.
  • Simplify codebase by removing redundant RateLimiter wrapper

How did you test it?

  • Unit/integration tests

Potential risks

  • Rate limiting is used widely across Cadence and complex. Errors here are bad.

Release notes

Documentation Changes

Comment on lines 53 to 55
return func(limiter *DynamicRateLimiter) {
limiter.ttl = ttl
}
Copy link
Member

@Groxx Groxx Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh I think func-closure-options are the worst option :\ they can't be introspected for tests or logs or comparison, and they're extremely hard to discover or go-to-source/impl/etc due to the weak typing they force.

could we change this to just an opts-struct? static typing is a heck of a lot more useful for stuff like this, we don't need years-long-API-stability-at-the-cost-of-ergonomics stuff.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

func (d *DynamicRateLimiter) maybeRefreshRps() {
now := d.timeSource.Now()
lastUpdated := d.lastUpdateTime.Load().(time.Time)
if now.After(lastUpdated.Add(d.ttl-1)) && d.lastUpdateTime.CompareAndSwap(lastUpdated, now) {
Copy link
Member

@Groxx Groxx Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh I'd just drop the -1 for simplicity, I doubt we care about nanosecond accuracy :) and if you move the tests to jump to some time in the middle, rather than exact boundaries, it won't matter if this changes slightly (which is good because it doesn't matter if it changes slightly - tests don't need to lock down behavior like that)

generally tho LGTM, and I don't think we care about interleaving calls - if getLimitAndBurst is "slow" then all values are valid-enough and the next update will fix it anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the -1 because it allows for setting a ttl of 0 which fully disables this behavior. Otherwise you'd need to set the TTL to -1 which seems counter intuitive.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, 0 is a pretty reasonable value for that 🤔 sounds good enough to me.
(I guess the alternative would be to switch to now.Compare(lastUpdated.Add(d.ttl)) > 0? that's not terrible, but also not obviously worth it, and it wouldn't change the -1/+1 fiddling in tests)

Comment on lines +52 to +55
if primaryLimit := f.primary(domain); primaryLimit > 0 {
return float64(primaryLimit)
}
return float64(f.secondary())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda hate this behavior, but yea. reasonable rewrite 👍

rl clock.Ratelimiter
timeSource clock.TimeSource
ttl time.Duration
lastUpdateTime atomic.Value
Copy link
Member

@Groxx Groxx Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a generics version of this in go.uber.org/atomic or similar

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to atomic.Pointer. atomic.Value seems to use a pointer under the hood (since there isn't a way to atomically rewrite that much memory at once) and it gives us generics. There's an atomic.Time but it doesn't provide CAS support.

We could alternative use an atomic uint64 for millis since epoch to eliminate the pointers but that's probably negligible from a performance standpoint.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, perf difference tends to be quite minor, and pointers prevent a LOT of accidentally-racy stuff from being racy (copies, ==, others). and (strangely) go.uber.org/atomic doesn't embed a nocopy, only nocmp, so those aren't prevented strongly enough imo. the safety's a bit more worth it unless that changes.

Comment on lines 221 to 261
cases := []struct {
name string
ttl time.Duration
elapsed time.Duration
initialRps int
newRps int
expected rate.Limit
}{
{
name: "success",
ttl: time.Second,
elapsed: time.Second - 1,
initialRps: 1,
newRps: 0,
expected: rate.Limit(1),
},
{
name: "refreshed",
ttl: time.Second,
elapsed: time.Second,
initialRps: 1,
newRps: 0,
expected: rate.Limit(0),
},
}
for _, tc := range cases {
t.Run(tc.name, func(t *testing.T) {
mockTime := clock.NewMockedTimeSource()
rps := tc.initialRps
rpsFunc := func() float64 {
res := rps
rps = tc.newRps
return float64(res)
}
limiter := NewDynamicRateLimiter(rpsFunc, WithTimeSource(mockTime), WithTTL(tc.ttl), WithMinBurst(1))

mockTime.Advance(tc.elapsed)
res := limiter.Limit()
assert.Equal(t, tc.expected, res)
})
}
Copy link
Member

@Groxx Groxx Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not actually requesting any changes, the tests achieve what they need to. just a "maybe rethink this pattern" note.


this is essentially "update changes the limit", yea?

so it's

var rps float64
limiter := NewDynamicRateLimiter(func() float64 {
	return rps
}, WithTimeSource(mockTime), WithTTL(time.Second), WithMinBurst(1))
assert.Equal(t, limiter.Limit(), 0)
mockTime.Advance(time.Second)
rps = 1
assert.Equal(t, limiter.Limit(), 1)

but with loooots of boilerplate and unused fields.

(I think having this separate from the repeated-calls is reasonable - it helps narrow down where an issue is, if one or both fail. this is just A Lot™ for no apparent reason, and mismatched types for no apparent reason either (int vs float64, requiring casts, though this is true on all of them))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I think that's a good simplification.

Comment on lines -114 to -117
if rl.maxDispatchPerSecond != nil {
return rate.Limit(*rl.maxDispatchPerSecond)
}
return rate.Inf
Copy link
Member

@Groxx Groxx Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reasonably sure this is not ever used in current code, but to call it out as a caution: you've definitely checked this behavior specifically?

for the rest of this file: 🎉

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than Matcher, every use case uses DynamicRateLimiter which has a function returning float64, so there's no chance of nil. Matcher meanwhile has a default value specified in the config if a poller doesn't provide a rate (as all Decision workers don't provide one) so it won't ever pass nil.

limiter *quotas.RateLimiter
limiter quotas.Limiter
// The most recently received Dispatch rate from a poller
lastReceivedRate atomic.Value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same generics-note

if rps == nil {
return
if rps != nil {
tm.lastReceivedRate.Store(*rps)
Copy link
Member

@Groxx Groxx Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm. tbh this file's changes feel kinda odd imo (not just this line, I just had to pick somewhere). Rate() float64 as a signature feels like it's implying a rather lightweight op and probably reusable, but it's doing quite a lot like

numReadPartitionsFn := func(cfg *config.TaskListConfig) int {
	if cfg.EnableGetNumberOfPartitionsFromCache() {
		partitionConfig := tlMgr.TaskListPartitionConfig()

// which calls
func (c *taskListManagerImpl) TaskListPartitionConfig() *types.TaskListPartitionConfig {
	c.partitionConfigLock.RLock()

and... that means it's both exposed in a "probably reusable" way and called in an uncontrolled / unknown way by the ratelimiter itself (the interface doesn't imply "this is definitely cached and not called super frequently", but it's relying on that behavior)

maybe this'd be a better place to use a clock.Ratelimiter (so it's settable), where it'd be almost a drop-in replacement? or is this particular call stack very high frequency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Matcher is the only place that was actually using quotas.RateLimiter directly, so it's definitely unique compared to the other changes.

UpdateRatelimit is actually called for every single poll request, passing in the TaskList rate limit. I think this the motivation for having the debounce behavior in quotas.RateLimiter originally and it's the only scenario where a TTL is specified.

If we update the rate limit every time it'll swing back and forth as changes to pollers roll out. We definitely want some debouncing here.

Rate is only called by TaskListManager#describeTaskList and now also passed to the DynamicRateLimiter, so it gets called much less often than UpdateRatelimit, even if we had a low TTL.

Copy link
Member

@Groxx Groxx Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, definitely agreed on debouncing, and doing that in Rate makes sense because it'll definitely eventually update. the behavior looks good, it just feels like a strange / forgettable control flow and expectations, unless it's documented somewhere that it's hard to miss.

we don't seem to have rate/rps/limit exposed almost anywhere except in matching here though so maybe that's fine. I thought there was more of a pattern of "current-limit-getters are lightweight" but that might just be because I'm familiar with the quotas/... package, and it isn't done anywhere else.

so I guess:

  • I'm game to keep it
  • it still feels a bit surprising, maybe some docs would be worth adding?
  • no complaint if nothing changes 👍

Copy link
Member

@Groxx Groxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(bleh, clicked the wrong one, sorry)

Copy link
Member

@Groxx Groxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Broadly very good and glad to see it, just some smaller stuff to consider (not force!)...

... except maybe that tasklist/matcher.go change. that feels potentially riskier than we should keep, and prone to missed expectations in future rewrites.
(and generic-atomics, that seems obviously preferred)

so marking 'no' just to push discussion and mark myself down for the next round. likely easy 👍

Copy link
Member

@Groxx Groxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, seems good. left some comments but I think it's good to go as-is too 👍

assert.ErrorIs(t, err, clock.ErrCannotWait)

mockTime.Advance(time.Second)
err = limiter.Wait(ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there's a non-obvious assumption in here because of how the clock stuff interacts, which is... annoying and probably not worth addressing, but pointing it out in case you get any ideas:

because the Wait-internal sleep is using the mock-time-source, this call will block forever if there isn't a full token available (i.e. if the call isn't fully non-blocking). mock-time won't advance while it's waiting, so it'll never un-sleep :|

I've wanted to make a mock-time-Context to address this, because it's a nasty surprise when writing these tests and suddenly getting long delays and timeouts rather than useful errors. but we don't have one yet.

the manual way to handle this is to launch a goroutine to advance mock-time if it delays for more than like 100ms, but it's quite a lot of boilerplate.

While dynamic config lookups are relatively cheap, they're certainly not free and perform several allocations further contributing to GC times. Making matters worse, quotas.RateLimiter has some strange TTL logic such that the result of evaluating the dynamic config value isn't used more than once a minute unless it's lower than the current value.

Delete quotas.RateLimiter in favor of clock.RateLimiter and move the TTL to DynamicRateLimiter. Reduce the TTL to a more reasonable value and only evaluate the function if the time has elapsed.

Remove the logic allowing the rate change to bypass the TTL if its lower than the current rate. This requires evaluating the RPS value constantly. Instead we've shortened the TTL such that we'll reliably pick up changes within a few seconds regardless of the direction of the change.

The main place that the TTL logic seems to be relevant is the Task Dispatch limiter within Matching. Each poller includes an RPS limit and we'd attempt to update the RPS on each request. This is the only place that explicitly provided a TTL to quotas.RateLimiter (60s) rather than relying on the default.

Change the Matching Rate limiter to use DynamicRateLimiter so that it also only updates according to its TTL. This is a change in behavior and will make Matching less responsive to changes specified by user requests. It still complies with the "take most recent value" behavior that is advertised.
@natemort natemort merged commit 6e55181 into cadence-workflow:master Apr 28, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants