Skip to content

default LBP: Warn on misconfigured local DC #1373

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

wprzytula
Copy link
Collaborator

Ref: #1371

What's done

DefaultPolicy now issues respective warnings:

  1. if the preferred datacenter is not present in the cluster;
  2. if all nodes in the preferred datacenter are disabled by the HostFilter.

This helps avoid confusion when the user expects requests to be sent to a specific datacenter, but it is not available.

Note

As warnings are going to be emitted on every request (because the configuration is most likely session-wide), the messages will quickly flood the logs and the user should notice.

Alternatively, we could implement some frequency limiting of those warnings, but we prefer to:

  • keep it simple for now,
  • let the user notice the misconfiguration as quickly as possible.

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • [ ] I have provided docstrings for the public items that I want to introduce.
  • [ ] I have adjusted the documentation in ./docs/source/.
  • [ ] I added appropriate Fixes: annotations to PR description.

@wprzytula wprzytula requested a review from Lorak-mmk June 5, 2025 06:33
@wprzytula wprzytula self-assigned this Jun 5, 2025
@wprzytula wprzytula added this to the 1.3.0 milestone Jun 5, 2025
Copy link

github-actions bot commented Jun 5, 2025

cargo semver-checks found no API-breaking changes in this PR.
Checked commit: 1458c5a

@wprzytula wprzytula force-pushed the warn-on-bad-local-dc branch from 35fc56e to 084f127 Compare June 6, 2025 11:54
@wprzytula wprzytula requested a review from Lorak-mmk June 6, 2025 11:54
@Lorak-mmk
Copy link
Collaborator

Alternatively, we could implement some frequency limiting of those warnings, but we prefer to:
keep it simple for now,
let the user notice the misconfiguration as quickly as possible.

Rate-limited warnings will also let the user know about the issue. They however won't totally break down the driver by logging a message for each request.
There may be scenarios, when the situation we warn about intentionally happens for a bit of time: for example removing entire DC from cluster.
If a user forgot to restart clients in order to change their preferred DC, we do want to remind them, but not totally destroy their application.

wprzytula added 3 commits June 8, 2025 20:08
`DefaultPolicy::pick()` begins with some sanity checks, whose goal is
to warn the user about misconfiguration. Those checks are extracted
into a separate method, `DefaultPolicy::pick_sanity_checks()`, so that
`pick()` is decluttered. The next commits introduce more sanity checks.
This makes the next commit cleaner.
DefaultPolicy now issues respective warnings:
1. if the preferred datacenter is not present in the cluster;
2. if all nodes in the preferred datacenter are disabled by
   the HostFilter.

This helps avoid confusion when the user expects requests to be sent
to a specific datacenter, but it is not available.

As warnings are going to be emitted on every request (because
the configuration is most likely session-wide), the messages will
quickly flood the logs and the user should notice.
Alternatively, we could implement some frequency limiting of those
warnings, but we rather prefer to:
- keep it simple for now,
- let the user notice the misconfiguration as quickly as possible.
@wprzytula wprzytula force-pushed the warn-on-bad-local-dc branch from 084f127 to e2ba54b Compare June 8, 2025 18:09
@wprzytula
Copy link
Collaborator Author

Rebased on main.

@wprzytula
Copy link
Collaborator Author

Added 3 new commits:

  1. Introduces RateLimiter - a generic solution to in-driver rate limiting, and associated macros.
  2. and 3. Make warnings in LBPs rate limited.

@wprzytula wprzytula force-pushed the warn-on-bad-local-dc branch 4 times, most recently from e0598e2 to d2e5fdc Compare June 8, 2025 19:27
wprzytula added 3 commits June 8, 2025 21:37
RateLimiter is a new utility that allows rate-limiting arbitrary
actions. It provides one central method, `try_acquire`, which checks if
the action should be performed based on a specified interval. If the
action has not been performed:
1. ever, or
2. in the last `interval`,
it updates the last-performed timestamp and returns `true`, allowing the
action to proceed. Otherwise, it returns `false`, indicating the action
should be skipped due to rate limiting.

This utility is designed to be efficient and thread-safe, using atomic
operations to ensure that multiple threads can safely check and update
the last-performed timestamp without locking.

The `RateLimiter` is intended for use in scenarios where actions need to
be performed at a controlled rate, such as logging, warning messages,
or other operations that should not be executed too frequently to avoid
spamming logs or overwhelming resources.

For now, it's used to rate-limit warning messages emitted by
the driver, in the implementation of `warn_rate_limited!` macro.

NOTE to reviewers: this code has been generated with major help of
the Claude Sonnet 4 AI model. Beware of common pitfalls of AI-generated
code.
In case of misconfiguration, the default load balancing policy can throw
a lot of warnings, which flood the logs and overwhelm the machine.
To prevent this, we rate limit the warnings there to one per second.

NOTE: each warning is rate limited separately, so if there are multiple
misconfigurations, each will still be logged.
There were some more warnings that were not rate-limited, which could
flood the logs in case of misconfiguration. This commit adds rate
limiting there, too.
@wprzytula wprzytula force-pushed the warn-on-bad-local-dc branch from d2e5fdc to 1458c5a Compare June 8, 2025 19:37
Copy link
Collaborator

@Lorak-mmk Lorak-mmk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the API of the newly introduced RateLimiter could be better. My idea:

  • Duration should be stored inside the limiter, because I find it unlikely to that we'll need to use multiple durations with single limiter.
  • Instead of having a method returning bool, we can have maybe_execute taking a closure that will be called if timer passed.
  • After the above, the macros won't really give much value over having the limiter stored explicitly inside DefaultPolicy (or having it explicitly as a static). Let's avoid macros unless they really do help in something.
  • Can you also move timestamp generator to use the new rate limiter?

@wprzytula
Copy link
Collaborator Author

I think the API of the newly introduced RateLimiter could be better. My idea:

  • Duration should be stored inside the limiter, because I find it unlikely to that we'll need to use multiple durations with single limiter.

Sure, I've also been considering that. Will move this setting to the constructor (and name the constructor new_with_interval()).

  • Instead of having a method returning bool, we can have maybe_execute taking a closure that will be called if timer passed.

This is limiting. This will, for instance, disallow for executing async code. Therefore, I'm in favour of the simple boolean output.
Now that I noticed this, I see that the way I implemented the macros has the same limitation. Well, but the macros are designed to be convenient for use, while the very RateLimiter should be maximally flexible, intended more as a building block than a usable as a stand-alone.

  • After the above, the macros won't really give much value over having the limiter stored explicitly inside DefaultPolicy (or having it explicitly as a static). Let's avoid macros unless they really do help in something.

The idea I had with the macros there was to make rate-limited warnings maximally welcoming, so that there's minimum effort and code bloat involved when employing them. Keep in mind that the way macro encapsulates the static instance of the RateLimiter makes it immune to accidental sharing of the same limiter for two separate warnings.

  • Can you also move timestamp generator to use the new rate limiter?

I believe I can. Will try.

@Lorak-mmk
Copy link
Collaborator

This is limiting. This will, for instance, disallow for executing async code.

This can be solved by having also an async version of this method. What are the other limitations?

@Lorak-mmk
Copy link
Collaborator

The idea I had with the macros there was to make rate-limited warnings maximally welcoming, so that there's minimum effort and code bloat involved when employing them. Keep in mind that the way macro encapsulates the static instance of the RateLimiter makes it immune to accidental sharing of the same limiter for two separate warnings.

In case of DefaultPolicy I don't see any reason to have a separate limiters for those 4 warnings. Each limiter is another global object that is always present. Macros also hide this overhead.

@wprzytula
Copy link
Collaborator Author

The idea I had with the macros there was to make rate-limited warnings maximally welcoming, so that there's minimum effort and code bloat involved when employing them. Keep in mind that the way macro encapsulates the static instance of the RateLimiter makes it immune to accidental sharing of the same limiter for two separate warnings.

In case of DefaultPolicy I don't see any reason to have a separate limiters for those 4 warnings. Each limiter is another global object that is always present. Macros also hide this overhead.

These warnings are different. We do want the user to get each of them, not only one. If we rate limit them together, then if multiple of them are triggered in close time, only the first one is printed.

@wprzytula
Copy link
Collaborator Author

This is limiting. This will, for instance, disallow for executing async code.

This can be solved by having also an async version of this method. What are the other limitations?

WDYM by the async version?
To have one method accepting a closure and one accepting an impl Future?

This seems bloaty. Why not just give users a flexible API that lets them utilize the RateLimiter as they want?

@wprzytula
Copy link
Collaborator Author

Each limiter is another global object that is always present. Macros also hide this overhead.

This overhead is justified: each warning string is, similarly, another static object that is always present. As the (I assume you mean the memory) overhead is tiny and static, I believe it's not anything to bikeshed about.

@Lorak-mmk
Copy link
Collaborator

This seems bloaty. Why not just give users a flexible API that lets them utilize the RateLimiter as they want?

Because it introduces macros for something that can be done without them, without much more verbosity.

@Lorak-mmk
Copy link
Collaborator

These warnings are different. We do want the user to get each of them, not only one. If we rate limit them together, then if multiple of them are triggered in close time, only the first one is printed.

It is extremely unlikely for multiple of those warnings to be triggered, I don't see much sense in caring about this case.

@dkropachev
Copy link
Collaborator

dkropachev commented Jun 11, 2025

Each limiter is another global object that is always present. Macros also hide this overhead.

This overhead is justified: each warning string is, similarly, another static object that is always present. As the (I assume you mean the memory) overhead is tiny and static, I believe it's not anything to bikeshed about.

What concerns me is that the driver performs these checks on every request, even though they’re usually unnecessary.
In effect, we’re sacrificing performance in the common case just to show a helpful warning when the user misconfigures the driver.
What if we would do that only when driver get into undesired state, say empty query plan or failover to another DC ?

@wprzytula
Copy link
Collaborator Author

This seems bloaty. Why not just give users a flexible API that lets them utilize the RateLimiter as they want?

Because it introduces macros for something that can be done without them, without much more verbosity.

You seem to mix two things. I'm arguing about exposing a boolean from the RateLimiter's method, in contrast to only accepting a closure/Future. You are arguing that macros are unneeded. These are two distinct things.

@Lorak-mmk
Copy link
Collaborator

You seem to mix two things. I'm arguing about exposing a boolean from the RateLimiter's method, in contrast to only accepting a closure/Future. You are arguing that macros are unneeded. These are two distinct things.

I'm actually arguing against both.
Boolean API imo makes no sense, because there is only one purpose of this boolean: to execute some action if it is true. In that case, taking an action to execute more directly aligns with the use-case.
Currently the biggest imo benefit of macros is avoiding writing the condition separately. If we provide action-API instead of boolean-API, then this advantage disappears. Declaring static object is imo not enough to warrant a macro - it is just one additional line.

@wprzytula
Copy link
Collaborator Author

Boolean API imo makes no sense, because there is only one purpose of this boolean: to execute some action if it is true. In that case, taking an action to execute more directly aligns with the use-case.

I can agree with this reasoning, but I still believe that there's no generic enough way to represent "an action" to create such API with just one method.

@wprzytula
Copy link
Collaborator Author

Currently the biggest imo benefit of macros is avoiding writing the condition separately. If we provide action-API instead of boolean-API, then this advantage disappears. Declaring static object is imo not enough to warrant a macro - it is just one additional line.

I see it much differently. If I just wanted to avoid writing the condition, I would employ a function, not a macro. It's exactly declaring the exclusive and encapsulated static object at the call site that I need the macro for. I've learned this pattern in TockOS, more specifically in board components. See this for reference. Such macros are the only way to provide the following combination of guarantees:

  • enforce static allocation of the backing data,
  • encapsulate and enforce exclusive use of the backing data by the call site.

@Lorak-mmk
Copy link
Collaborator

I can agree with this reasoning, but I still believe that there's no generic enough way to represent "an action" to create such API with just one method.

I really don't see the problem with having 2 - try_execute and try_execute_async.
Rust doesn't have effect system / keyword generics, and afaik this is the most typical way to deal with having sync / async version of something.

@Lorak-mmk
Copy link
Collaborator

encapsulate and enforce exclusive use of the backing data by the call site.

I don't see why do you need such strict guarantees for this use case. Manually written static will be just as good here. It is not some kernel code where such security level matters.

@Lorak-mmk
Copy link
Collaborator

Lorak-mmk commented Jun 11, 2025

I also don't understand how macro is better than a scope:

    {
        static TEST: i32 = 42;
        let x = TEST;
    }
    let y = TEST; // doesn't work

If some code really needs to guarantee that the static won't be accessed outside, then it can introduce a scope. There is no way to access the static without editing code in the scope.
I still also think that such strong guarantees are not necessary for logging code.

@wprzytula
Copy link
Collaborator Author

wprzytula commented Jun 16, 2025

Each limiter is another global object that is always present. Macros also hide this overhead.

This overhead is justified: each warning string is, similarly, another static object that is always present. As the (I assume you mean the memory) overhead is tiny and static, I believe it's not anything to bikeshed about.

What concerns me is that the driver performs these checks on every request, even though they’re usually unnecessary. In effect, we’re sacrificing performance in the common case just to show a helpful warning when the user misconfigures the driver. What if we would do that only when driver get into undesired state, say empty query plan or failover to another DC ?

I understand your concerns.

However, I think these checks should be cheap. @Lorak-mmk do you agree?

@Lorak-mmk
Copy link
Collaborator

I understand your concerns.

However, if think these checks should be cheap. @Lorak-mmk do you agree?

I think so. It is mostly simple comparisons. The most expensive part is querying a hashmap, which I don't think will be too slow here. If it turns out to be too slow, we can think of some workarounds (for example, performing whole logic of pick, and only performing the checks if it returned None, or a node from non-preferred DC).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants