Skip to content

Ratelimiter-wrapper improvement: don't release the lock when synchronously rejecting a Wait #6721

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 8, 2025

Conversation

Groxx
Copy link
Member

@Groxx Groxx commented Mar 7, 2025

We're getting some odd Wait-using ratelimiting issues in our benchmarking-cluster (significant under-allowing, like <10%), and reading through code carefully found a small issue here which could be a cause.

The issue:

Because the lock was released and then re-acquired when canceling a reserved token when there is insufficient deadline to wait, it's possible for another goroutine to acquire the lock, advance now, and cause the res.CancelAt to fail to return a token.
This could lead to over-limit waiting (negative tokens), and e.g. arbitrarily long delays that reject many short-deadline requests, if bad-enough sequences were triggered.

I have unfortunately NOT been able to confirm this though, even fairly extreme local benchmarking with goroutine-yielding in the critical section seems to work fine. It slows down a bit under very high contention (thousands of contending goroutines), but that's not surprising - the underlying ratelimiter does too. And it doesn't truly match what the benchmarking-cluster does, concurrency there is rather low, and in low-ish concurrency this wrapper appears to behave perfectly (unlike the underlying limiter).

But, it's a clear possibility.

The fix

Luckly this gap can be completely eliminated: cancel the token without releasing the lock.
I'm not sure why I didn't do this earlier tbh, it seems obvious now.

That's easy and worth doing, and we can rerun the bigger benchmark that's causing weird behavior to see if the issue goes away. If it doesn't, it's at least a sign that it isn't the limiter.

The same "unable to cancel" issue does still exist if a wait is canceled while waiting, but I believe that is fundamentally unavoidable with the current implementation because cancels are unreliable. It'd also require racing cancels near every Wait-completion to produce a noticeable dip in throughput, so I don't think anyone will be able to observe this one IRL.

Perf changes / benchmark result changes

None that I can see. I'm a bit surprised that it isn't slightly faster, but I see no actual evidence of it.

@@ -40,7 +40,7 @@ type (
}
)

func TesteventTimerGeteSuite(t *testing.T) {
func TestEventTimerGeteSuite(t *testing.T) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go 1.24 complains about "Test[lowercase]", but Go 1.23 (in CI) does not.

So this is just a fix for 1.24 users, and it'll be enforced when we update the go.work toolchain / ci / etc.

@@ -303,7 +303,7 @@ func BenchmarkLimiter(b *testing.B) {
// b.Run seems to target <1s, but very fast or very slow things can extend it,
// and up to about 2s seems normal for these.
// because of that, 5s occasionally timed out, but 10s seems fine.
timeout := 10 * time.Second
timeout := 20 * time.Second
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runs of this benchmark group regularly took ~10s locally, so I'm kinda surprised it hasn't been failing more. maybe 1.24 tries longer to stabilize?

@Groxx Groxx enabled auto-merge (squash) March 8, 2025 00:20
@Groxx Groxx merged commit a9996c4 into cadence-workflow:master Mar 8, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants