Ratelimiter-wrapper improvement: don't release the lock when synchronously rejecting a Wait #6721

Groxx · 2025-03-07T23:27:35Z

We're getting some odd Wait-using ratelimiting issues in our benchmarking-cluster (significant under-allowing, like <10%), and reading through code carefully found a small issue here which could be a cause.

The issue:

Because the lock was released and then re-acquired when canceling a reserved token when there is insufficient deadline to wait, it's possible for another goroutine to acquire the lock, advance now, and cause the res.CancelAt to fail to return a token.
This could lead to over-limit waiting (negative tokens), and e.g. arbitrarily long delays that reject many short-deadline requests, if bad-enough sequences were triggered.

I have unfortunately NOT been able to confirm this though, even fairly extreme local benchmarking with goroutine-yielding in the critical section seems to work fine. It slows down a bit under very high contention (thousands of contending goroutines), but that's not surprising - the underlying ratelimiter does too. And it doesn't truly match what the benchmarking-cluster does, concurrency there is rather low, and in low-ish concurrency this wrapper appears to behave perfectly (unlike the underlying limiter).

But, it's a clear possibility.

The fix

Luckly this gap can be completely eliminated: cancel the token without releasing the lock.
I'm not sure why I didn't do this earlier tbh, it seems obvious now.

That's easy and worth doing, and we can rerun the bigger benchmark that's causing weird behavior to see if the issue goes away. If it doesn't, it's at least a sign that it isn't the limiter.

The same "unable to cancel" issue does still exist if a wait is canceled while waiting, but I believe that is fundamentally unavoidable with the current implementation because cancels are unreliable. It'd also require racing cancels near every Wait-completion to produce a noticeable dip in throughput, so I don't think anyone will be able to observe this one IRL.

Perf changes / benchmark result changes

None that I can see. I'm a bit surprised that it isn't slightly faster, but I see no actual evidence of it.

…ously rejecting a Wait

Groxx · 2025-03-07T23:31:00Z

common/clock/event_timer_gate_test.go

@@ -40,7 +40,7 @@ type (
 	}
 )

-func TesteventTimerGeteSuite(t *testing.T) {
+func TestEventTimerGeteSuite(t *testing.T) {


Go 1.24 complains about "Test[lowercase]", but Go 1.23 (in CI) does not.

So this is just a fix for 1.24 users, and it'll be enforced when we update the go.work toolchain / ci / etc.

Groxx · 2025-03-07T23:31:59Z

common/clock/ratelimiter_bench_test.go

@@ -303,7 +303,7 @@ func BenchmarkLimiter(b *testing.B) {
 						// b.Run seems to target <1s, but very fast or very slow things can extend it,
 						// and up to about 2s seems normal for these.
 						// because of that, 5s occasionally timed out, but 10s seems fine.
-						timeout := 10 * time.Second
+						timeout := 20 * time.Second


runs of this benchmark group regularly took ~10s locally, so I'm kinda surprised it hasn't been failing more. maybe 1.24 tries longer to stabilize?

Ratelimiter-wrapper improvement: don't release the lock when synchron…

e5c5994

…ously rejecting a Wait

Groxx requested review from Shaddoll, neil-xie, davidporter-id-au, shijiesheng, jakobht, 3vilhamster, sankari165, dkrotx, taylanisikdemir and demirkayaender as code owners March 7, 2025 23:27

Groxx commented Mar 7, 2025

View reviewed changes

Shaddoll approved these changes Mar 7, 2025

View reviewed changes

Groxx enabled auto-merge (squash) March 8, 2025 00:20

Groxx merged commit a9996c4 into cadence-workflow:master Mar 8, 2025
22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ratelimiter-wrapper improvement: don't release the lock when synchronously rejecting a Wait #6721

Ratelimiter-wrapper improvement: don't release the lock when synchronously rejecting a Wait #6721

Uh oh!

Groxx commented Mar 7, 2025 •

edited

Loading

Uh oh!

Groxx Mar 7, 2025

Uh oh!

Groxx Mar 7, 2025

Uh oh!

Uh oh!

Uh oh!

Ratelimiter-wrapper improvement: don't release the lock when synchronously rejecting a Wait #6721

Ratelimiter-wrapper improvement: don't release the lock when synchronously rejecting a Wait #6721

Uh oh!

Conversation

Groxx commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The issue:

The fix

Perf changes / benchmark result changes

Uh oh!

Groxx Mar 7, 2025

Choose a reason for hiding this comment

Uh oh!

Groxx Mar 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Groxx commented Mar 7, 2025 •

edited

Loading