default LBP: Warn on misconfigured local DC #1373

wprzytula · 2025-06-05T06:33:10Z

What's done

DefaultPolicy now issues respective warnings:

if the preferred datacenter is not present in the cluster;
if all nodes in the preferred datacenter are disabled by the HostFilter.

This helps avoid confusion when the user expects requests to be sent to a specific datacenter, but it is not available.

Note

As warnings are going to be emitted on every request (because the configuration is most likely session-wide), the messages will quickly flood the logs and the user should notice.

Alternatively, we could implement some frequency limiting of those warnings, but we prefer to:

keep it simple for now,
let the user notice the misconfiguration as quickly as possible.

Pre-review checklist

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
I added relevant tests for new features and bug fixes.
All commits compile, pass static checks and pass test.
PR description sums up the changes and reasons why they should be introduced.
~~[ ] I have provided docstrings for the public items that I want to introduce.~~
~~[ ] I have adjusted the documentation in ./docs/source/.~~
~~[ ] I added appropriate Fixes: annotations to PR description.~~

github-actions · 2025-06-05T06:35:40Z

cargo semver-checks found no API-breaking changes in this PR.
Checked commit: 1458c5a

scylla/src/policies/load_balancing/default.rs

Lorak-mmk · 2025-06-06T12:05:52Z

Alternatively, we could implement some frequency limiting of those warnings, but we prefer to:
keep it simple for now,
let the user notice the misconfiguration as quickly as possible.

Rate-limited warnings will also let the user know about the issue. They however won't totally break down the driver by logging a message for each request.
There may be scenarios, when the situation we warn about intentionally happens for a bit of time: for example removing entire DC from cluster.
If a user forgot to restart clients in order to change their preferred DC, we do want to remind them, but not totally destroy their application.

`DefaultPolicy::pick()` begins with some sanity checks, whose goal is to warn the user about misconfiguration. Those checks are extracted into a separate method, `DefaultPolicy::pick_sanity_checks()`, so that `pick()` is decluttered. The next commits introduce more sanity checks.

This makes the next commit cleaner.

DefaultPolicy now issues respective warnings: 1. if the preferred datacenter is not present in the cluster; 2. if all nodes in the preferred datacenter are disabled by the HostFilter. This helps avoid confusion when the user expects requests to be sent to a specific datacenter, but it is not available. As warnings are going to be emitted on every request (because the configuration is most likely session-wide), the messages will quickly flood the logs and the user should notice. Alternatively, we could implement some frequency limiting of those warnings, but we rather prefer to: - keep it simple for now, - let the user notice the misconfiguration as quickly as possible.

wprzytula · 2025-06-08T18:09:58Z

Rebased on main.

wprzytula · 2025-06-08T18:12:10Z

Added 3 new commits:

Introduces RateLimiter - a generic solution to in-driver rate limiting, and associated macros.
and 3. Make warnings in LBPs rate limited.

RateLimiter is a new utility that allows rate-limiting arbitrary actions. It provides one central method, `try_acquire`, which checks if the action should be performed based on a specified interval. If the action has not been performed: 1. ever, or 2. in the last `interval`, it updates the last-performed timestamp and returns `true`, allowing the action to proceed. Otherwise, it returns `false`, indicating the action should be skipped due to rate limiting. This utility is designed to be efficient and thread-safe, using atomic operations to ensure that multiple threads can safely check and update the last-performed timestamp without locking. The `RateLimiter` is intended for use in scenarios where actions need to be performed at a controlled rate, such as logging, warning messages, or other operations that should not be executed too frequently to avoid spamming logs or overwhelming resources. For now, it's used to rate-limit warning messages emitted by the driver, in the implementation of `warn_rate_limited!` macro. NOTE to reviewers: this code has been generated with major help of the Claude Sonnet 4 AI model. Beware of common pitfalls of AI-generated code.

In case of misconfiguration, the default load balancing policy can throw a lot of warnings, which flood the logs and overwhelm the machine. To prevent this, we rate limit the warnings there to one per second. NOTE: each warning is rate limited separately, so if there are multiple misconfigurations, each will still be logged.

There were some more warnings that were not rate-limited, which could flood the logs in case of misconfiguration. This commit adds rate limiting there, too.

Lorak-mmk

I think the API of the newly introduced RateLimiter could be better. My idea:

Duration should be stored inside the limiter, because I find it unlikely to that we'll need to use multiple durations with single limiter.
Instead of having a method returning bool, we can have maybe_execute taking a closure that will be called if timer passed.
After the above, the macros won't really give much value over having the limiter stored explicitly inside DefaultPolicy (or having it explicitly as a static). Let's avoid macros unless they really do help in something.
Can you also move timestamp generator to use the new rate limiter?

wprzytula · 2025-06-11T08:55:09Z

I think the API of the newly introduced RateLimiter could be better. My idea:

Duration should be stored inside the limiter, because I find it unlikely to that we'll need to use multiple durations with single limiter.

Sure, I've also been considering that. Will move this setting to the constructor (and name the constructor new_with_interval()).

Instead of having a method returning bool, we can have maybe_execute taking a closure that will be called if timer passed.

This is limiting. This will, for instance, disallow for executing async code. Therefore, I'm in favour of the simple boolean output.
Now that I noticed this, I see that the way I implemented the macros has the same limitation. Well, but the macros are designed to be convenient for use, while the very RateLimiter should be maximally flexible, intended more as a building block than a usable as a stand-alone.

After the above, the macros won't really give much value over having the limiter stored explicitly inside DefaultPolicy (or having it explicitly as a static). Let's avoid macros unless they really do help in something.

The idea I had with the macros there was to make rate-limited warnings maximally welcoming, so that there's minimum effort and code bloat involved when employing them. Keep in mind that the way macro encapsulates the static instance of the RateLimiter makes it immune to accidental sharing of the same limiter for two separate warnings.

Can you also move timestamp generator to use the new rate limiter?

I believe I can. Will try.

Lorak-mmk · 2025-06-11T09:04:38Z

This is limiting. This will, for instance, disallow for executing async code.

This can be solved by having also an async version of this method. What are the other limitations?

Lorak-mmk · 2025-06-11T09:06:01Z

The idea I had with the macros there was to make rate-limited warnings maximally welcoming, so that there's minimum effort and code bloat involved when employing them. Keep in mind that the way macro encapsulates the static instance of the RateLimiter makes it immune to accidental sharing of the same limiter for two separate warnings.

In case of DefaultPolicy I don't see any reason to have a separate limiters for those 4 warnings. Each limiter is another global object that is always present. Macros also hide this overhead.

wprzytula · 2025-06-11T10:58:26Z

The idea I had with the macros there was to make rate-limited warnings maximally welcoming, so that there's minimum effort and code bloat involved when employing them. Keep in mind that the way macro encapsulates the static instance of the RateLimiter makes it immune to accidental sharing of the same limiter for two separate warnings.

In case of DefaultPolicy I don't see any reason to have a separate limiters for those 4 warnings. Each limiter is another global object that is always present. Macros also hide this overhead.

These warnings are different. We do want the user to get each of them, not only one. If we rate limit them together, then if multiple of them are triggered in close time, only the first one is printed.

wprzytula · 2025-06-11T11:00:01Z

This is limiting. This will, for instance, disallow for executing async code.

This can be solved by having also an async version of this method. What are the other limitations?

WDYM by the async version?
To have one method accepting a closure and one accepting an impl Future?

This seems bloaty. Why not just give users a flexible API that lets them utilize the RateLimiter as they want?

wprzytula · 2025-06-11T11:02:18Z

Each limiter is another global object that is always present. Macros also hide this overhead.

This overhead is justified: each warning string is, similarly, another static object that is always present. As the (I assume you mean the memory) overhead is tiny and static, I believe it's not anything to bikeshed about.

Lorak-mmk · 2025-06-11T11:34:37Z

This seems bloaty. Why not just give users a flexible API that lets them utilize the RateLimiter as they want?

Because it introduces macros for something that can be done without them, without much more verbosity.

Lorak-mmk · 2025-06-11T11:35:48Z

These warnings are different. We do want the user to get each of them, not only one. If we rate limit them together, then if multiple of them are triggered in close time, only the first one is printed.

It is extremely unlikely for multiple of those warnings to be triggered, I don't see much sense in caring about this case.

dkropachev · 2025-06-11T11:35:59Z

Each limiter is another global object that is always present. Macros also hide this overhead.

This overhead is justified: each warning string is, similarly, another static object that is always present. As the (I assume you mean the memory) overhead is tiny and static, I believe it's not anything to bikeshed about.

What concerns me is that the driver performs these checks on every request, even though they’re usually unnecessary.
In effect, we’re sacrificing performance in the common case just to show a helpful warning when the user misconfigures the driver.
What if we would do that only when driver get into undesired state, say empty query plan or failover to another DC ?

wprzytula · 2025-06-11T14:20:22Z

This seems bloaty. Why not just give users a flexible API that lets them utilize the RateLimiter as they want?

Because it introduces macros for something that can be done without them, without much more verbosity.

You seem to mix two things. I'm arguing about exposing a boolean from the RateLimiter's method, in contrast to only accepting a closure/Future. You are arguing that macros are unneeded. These are two distinct things.

Lorak-mmk · 2025-06-11T14:31:46Z

You seem to mix two things. I'm arguing about exposing a boolean from the RateLimiter's method, in contrast to only accepting a closure/Future. You are arguing that macros are unneeded. These are two distinct things.

I'm actually arguing against both.
Boolean API imo makes no sense, because there is only one purpose of this boolean: to execute some action if it is true. In that case, taking an action to execute more directly aligns with the use-case.
Currently the biggest imo benefit of macros is avoiding writing the condition separately. If we provide action-API instead of boolean-API, then this advantage disappears. Declaring static object is imo not enough to warrant a macro - it is just one additional line.

wprzytula · 2025-06-11T16:28:54Z

Boolean API imo makes no sense, because there is only one purpose of this boolean: to execute some action if it is true. In that case, taking an action to execute more directly aligns with the use-case.

I can agree with this reasoning, but I still believe that there's no generic enough way to represent "an action" to create such API with just one method.

wprzytula · 2025-06-11T16:35:55Z

Currently the biggest imo benefit of macros is avoiding writing the condition separately. If we provide action-API instead of boolean-API, then this advantage disappears. Declaring static object is imo not enough to warrant a macro - it is just one additional line.

I see it much differently. If I just wanted to avoid writing the condition, I would employ a function, not a macro. It's exactly declaring the exclusive and encapsulated static object at the call site that I need the macro for. I've learned this pattern in TockOS, more specifically in board components. See this for reference. Such macros are the only way to provide the following combination of guarantees:

enforce static allocation of the backing data,
encapsulate and enforce exclusive use of the backing data by the call site.

Lorak-mmk · 2025-06-11T16:36:10Z

I can agree with this reasoning, but I still believe that there's no generic enough way to represent "an action" to create such API with just one method.

I really don't see the problem with having 2 - try_execute and try_execute_async.
Rust doesn't have effect system / keyword generics, and afaik this is the most typical way to deal with having sync / async version of something.

Lorak-mmk · 2025-06-11T16:53:37Z

encapsulate and enforce exclusive use of the backing data by the call site.

I don't see why do you need such strict guarantees for this use case. Manually written static will be just as good here. It is not some kernel code where such security level matters.

Lorak-mmk · 2025-06-11T17:06:30Z

I also don't understand how macro is better than a scope:

    {
        static TEST: i32 = 42;
        let x = TEST;
    }
    let y = TEST; // doesn't work

If some code really needs to guarantee that the static won't be accessed outside, then it can introduce a scope. There is no way to access the static without editing code in the scope.
I still also think that such strong guarantees are not necessary for logging code.

wprzytula · 2025-06-16T07:52:05Z

Each limiter is another global object that is always present. Macros also hide this overhead.

This overhead is justified: each warning string is, similarly, another static object that is always present. As the (I assume you mean the memory) overhead is tiny and static, I believe it's not anything to bikeshed about.

What concerns me is that the driver performs these checks on every request, even though they’re usually unnecessary. In effect, we’re sacrificing performance in the common case just to show a helpful warning when the user misconfigures the driver. What if we would do that only when driver get into undesired state, say empty query plan or failover to another DC ?

I understand your concerns.

However, I think these checks should be cheap. @Lorak-mmk do you agree?

Lorak-mmk · 2025-06-16T08:01:49Z

I understand your concerns.

However, if think these checks should be cheap. @Lorak-mmk do you agree?

I think so. It is mostly simple comparisons. The most expensive part is querying a hashmap, which I don't think will be too slow here. If it turns out to be too slow, we can think of some workarounds (for example, performing whole logic of pick, and only performing the checks if it returned None, or a node from non-preferred DC).

wprzytula requested a review from Lorak-mmk June 5, 2025 06:33

wprzytula self-assigned this Jun 5, 2025

wprzytula added the area/load-balancing label Jun 5, 2025

wprzytula added this to the 1.3.0 milestone Jun 5, 2025

Lorak-mmk requested changes Jun 6, 2025

View reviewed changes

scylla/src/policies/load_balancing/default.rs Outdated Show resolved Hide resolved

scylla/src/policies/load_balancing/default.rs Outdated Show resolved Hide resolved

wprzytula force-pushed the warn-on-bad-local-dc branch from 35fc56e to 084f127 Compare June 6, 2025 11:54

wprzytula requested a review from Lorak-mmk June 6, 2025 11:54

wprzytula added 3 commits June 8, 2025 20:08

default LBP: pick_sanity_checks() slight refactor

c6edc74

This makes the next commit cleaner.

wprzytula force-pushed the warn-on-bad-local-dc branch from 084f127 to e2ba54b Compare June 8, 2025 18:09

wprzytula force-pushed the warn-on-bad-local-dc branch 4 times, most recently from e0598e2 to d2e5fdc Compare June 8, 2025 19:27

wprzytula added 3 commits June 8, 2025 21:37

LBPs: rate limit various warnings

1458c5a

There were some more warnings that were not rate-limited, which could flood the logs in case of misconfiguration. This commit adds rate limiting there, too.

wprzytula force-pushed the warn-on-bad-local-dc branch from d2e5fdc to 1458c5a Compare June 8, 2025 19:37

Lorak-mmk requested changes Jun 10, 2025

View reviewed changes

default LBP: Warn on misconfigured local DC #1373

Are you sure you want to change the base?

default LBP: Warn on misconfigured local DC #1373

Uh oh!

Conversation

wprzytula commented Jun 5, 2025

What's done

Note

Pre-review checklist

Uh oh!

github-actions bot commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Lorak-mmk commented Jun 6, 2025

Uh oh!

wprzytula commented Jun 8, 2025

Uh oh!

wprzytula commented Jun 8, 2025

Uh oh!

Lorak-mmk left a comment

Choose a reason for hiding this comment

Uh oh!

wprzytula commented Jun 11, 2025

Uh oh!

Lorak-mmk commented Jun 11, 2025

Uh oh!

Lorak-mmk commented Jun 11, 2025

Uh oh!

wprzytula commented Jun 11, 2025

Uh oh!

wprzytula commented Jun 11, 2025

Uh oh!

wprzytula commented Jun 11, 2025

Uh oh!

Lorak-mmk commented Jun 11, 2025

Uh oh!

Lorak-mmk commented Jun 11, 2025

Uh oh!

dkropachev commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wprzytula commented Jun 11, 2025

Uh oh!

Lorak-mmk commented Jun 11, 2025

Uh oh!

wprzytula commented Jun 11, 2025

Uh oh!

wprzytula commented Jun 11, 2025

Uh oh!

Lorak-mmk commented Jun 11, 2025

Uh oh!

Lorak-mmk commented Jun 11, 2025

Uh oh!

Lorak-mmk commented Jun 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wprzytula commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lorak-mmk commented Jun 16, 2025

Uh oh!

Uh oh!

github-actions bot commented Jun 5, 2025 •

edited

Loading

dkropachev commented Jun 11, 2025 •

edited

Loading

Lorak-mmk commented Jun 11, 2025 •

edited

Loading

wprzytula commented Jun 16, 2025 •

edited

Loading