Add option for more control on connection recovery #1535

gavanore · 2025-01-29T18:04:55Z

Is your feature request related to a problem? Please describe.

My colleague chatted with you on Discord about this. Our main use case is disparate clusters, where each one is behind a load balancer. We want to use one of those clusters as a primary location, and only swap to the other cluster when the main cluster is completely down/unavailable. We think that this could be easily achievable with a little more interrogation from externally-pluggable logic when setting up the ConnectionFactory.

Describe the solution you'd like

Ultimately something like the RetryListener, but for Connections and not just for Topology. Or, it could be done with lambdas like Predicates, and also on the connection that failed. Maybe also a connection retry count passed in to help make judgments.

We envision setting cluster tags in our servers that inform the client about which cluster they're connected to, and perhaps additionally a cluster tag to indicate that the address used was behind a load balancer. So, we could check to see if the server tags indicate a load balancer address, combined with the reason the connection was shut down.

Maybe an easy way to plug this in currently is to have an interface that returns an AddressResolver, so an easy default implementation is to return the current AddressResolver unconditionally. This would preserve current behavior.

So, maybe, all notional:

public interface ConnectionRetryListener {
    AddressResolver onRetry(Connection failed, Exception cause, int retryCount);
}

/* somewhere in initialization */
if (this.connectionRetryListener == null) {
    this.connectionRetryListener = (conn, cause, count) -> this.addressResolver;
}

Then we could send a non-shuffling list of [secondary, primary] when there's an unexpected issue, or if retry count goes higher than some tolerable level, otherwise ask the system to attempt [primary, secondary] in standard scenarios. Or even skip sending primary/secondary together and let the new implementation determine whether the primary or secondary should be tried by itself. I.e. try primary three times, then try secondary three times, then give up.

I have not yet looked at downstream impacts of wiring this through the existing code. First just want to hash ideas on what you guys like / don't like. We're willing to do the legwork to contribute.

Describe alternatives you've considered

Currently we override AddressResolver to always return a fixed list and skip shuffling, which works mostly well but there are edge cases where a client may cascade to the more distant cluster when their primary is still up.

Additional context

No response

The text was updated successfully, but these errors were encountered:

michaelklishin · 2025-01-29T21:32:31Z

Sounds like a problem for a stateful AddressResolver and an AddressResolver extension that would tell it what is the state/name/tag of the current connection.

This may require introduce a way to tag a Connection, when a ConnectionFactory creates one or after that, to help the AddressResolver above identify the current destination.

I'd not modify the client beyond that for this exotic use case where clusters disappear and clients should not even try to connect to the same cluster.

acogoluegnes · 2025-01-30T08:52:37Z

This is the place of connection recovery currently.

It could be indeed an extension to AddressResolver (stateful or not, a new method would have to provide the state for the latter), but a new hook is fine as well, especially if what you introduce does not exactly fit the AddressResolver. Whatever we end up with, having a default "passthrough" implementation to avoid any breaking changes is what we prefer.

In your example of the ConnectionRetryListener I would use a single Context/State argument (can be an inner interface in the main interface) that contains the information you need. This makes it easier to add new information in the future, instead of breaking the interface by adding a new parameter in the main method.

You can create a PR to start iterating. I'd be curious to learn more about why you need the failed connection for and how you would handle "cluster tags". Maybe you could add a test that simulates a simplified version of your use case if that is not too much work (always good to validate the design).

gavanore added the enhancement label Jan 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option for more control on connection recovery #1535

Add option for more control on connection recovery #1535

gavanore commented Jan 29, 2025

michaelklishin commented Jan 29, 2025

acogoluegnes commented Jan 30, 2025

Add option for more control on connection recovery #1535

Add option for more control on connection recovery #1535

Comments

gavanore commented Jan 29, 2025

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

michaelklishin commented Jan 29, 2025

acogoluegnes commented Jan 30, 2025