Ratelimiter-wrapper improvement: don't release the lock when synchronously rejecting a Wait #6721
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We're getting some odd
Wait
-using ratelimiting issues in our benchmarking-cluster (significant under-allowing, like <10%), and reading through code carefully found a small issue here which could be a cause.The issue:
Because the lock was released and then re-acquired when canceling a reserved token when there is insufficient deadline to wait, it's possible for another goroutine to acquire the lock, advance
now
, and cause theres.CancelAt
to fail to return a token.This could lead to over-limit waiting (negative tokens), and e.g. arbitrarily long delays that reject many short-deadline requests, if bad-enough sequences were triggered.
I have unfortunately NOT been able to confirm this though, even fairly extreme local benchmarking with goroutine-yielding in the critical section seems to work fine. It slows down a bit under very high contention (thousands of contending goroutines), but that's not surprising - the underlying ratelimiter does too. And it doesn't truly match what the benchmarking-cluster does, concurrency there is rather low, and in low-ish concurrency this wrapper appears to behave perfectly (unlike the underlying limiter).
But, it's a clear possibility.
The fix
Luckly this gap can be completely eliminated: cancel the token without releasing the lock.
I'm not sure why I didn't do this earlier tbh, it seems obvious now.
That's easy and worth doing, and we can rerun the bigger benchmark that's causing weird behavior to see if the issue goes away. If it doesn't, it's at least a sign that it isn't the limiter.
The same "unable to cancel" issue does still exist if a wait is canceled while waiting, but I believe that is fundamentally unavoidable with the current implementation because cancels are unreliable. It'd also require racing cancels near every Wait-completion to produce a noticeable dip in throughput, so I don't think anyone will be able to observe this one IRL.
Perf changes / benchmark result changes
None that I can see. I'm a bit surprised that it isn't slightly faster, but I see no actual evidence of it.