Skip to content

Expose request coordinator #1287

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
May 9, 2025

Conversation

wprzytula
Copy link
Collaborator

@wprzytula wprzytula commented Mar 20, 2025

This is the second approach to #1030.

Coordinator

Coordinator is defined as follows:

/// The coordinator of a CQL request, i.e., the node+shard that receives
/// and processes the request, and hopefully eventually sends a response.
#[derive(Debug, Clone)]
pub struct Coordinator {
    /// Translated address, i.e., one that the connection is opened against.
    connection_address: SocketAddr,
    /// The node that served as coordinator.
    node: Arc<Node>,
    /// Number of the shard, if applicable (present for ScyllaDB nodes, absent for Cassandra).
    shard: Option<Shard>,
}

It has pub getters for all of its fields.

Coordinator is built from NodeRef (untranslated_address and host_id), Shard (shard), and Connection (connection_address).

Use

Coordinator is stored in QueryResult (as well as QueryRowsResult) and QueryPager (together with TypedRowStream). QueryPager exposes an iterator of references to Coordinators, each of which served one page.

Testing

No tests yet.

Fixes: #1030

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have provided docstrings for the public items that I want to introduce.
  • I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

@wprzytula wprzytula requested review from muzarski and Lorak-mmk March 20, 2025 07:46
@wprzytula wprzytula added this to the 1.1.0 milestone Mar 20, 2025
@wprzytula wprzytula added cpp-rust-driver-p1 Functionality required by cpp-rust-driver QoL labels Mar 20, 2025
Copy link

github-actions bot commented Mar 20, 2025

cargo semver-checks found no API-breaking changes in this PR.
Checked commit: b5a744e

@wprzytula wprzytula force-pushed the store-request-coordinator branch from 7a45481 to c2860f5 Compare March 20, 2025 07:49
@Lorak-mmk
Copy link
Collaborator

Why did you introduce new struct (Coordinator) instead of using Node?

@wprzytula
Copy link
Collaborator Author

Why did you introduce new struct (Coordinator) instead of using Node?

  1. I did not realise that we do have access to Node in all places where we need to create Coordinator -- this fortunately seems to be true.
  2. Coordinator is not a struct. It's just a type alias that is used for convenience. I imagined it could be a pair (Uuid, SocketAddr) in the final version of this PR.

@muzarski
Copy link
Contributor

Why did you introduce new struct (Coordinator) instead of using Node?

Just Node is not enough for cpp-rust-driver purposes. In cpp-driver, the coordinator is the target (i.e. node + shard) identified by SocketAddr.

@Lorak-mmk
Copy link
Collaborator

Why did you introduce new struct (Coordinator) instead of using Node?

Just Node is not enough for cpp-rust-driver purposes. In cpp-driver, the coordinator is the target (i.e. node + shard) identified by SocketAddr.

Is there really an API in cpp-driver to retrieve a shard that the request was sent to?

@muzarski
Copy link
Contributor

Why did you introduce new struct (Coordinator) instead of using Node?

Just Node is not enough for cpp-rust-driver purposes. In cpp-driver, the coordinator is the target (i.e. node + shard) identified by SocketAddr.

Is there really an API in cpp-driver to retrieve a shard that the request was sent to?

No, I might have phrased it wrong. Now that I think of it, I did, because there are no shards in Cassandra. What I meant is just that the request coordinator is identified via ip + port (SocketAddr).

@Lorak-mmk
Copy link
Collaborator

What I meant is just that the request coordinator is identified via ip + port (SocketAddr).

Than I don't understand why Node is not enough. It has addr field which contains SocketAddr inside: https://docs.rs/scylla/latest/scylla/cluster/enum.NodeAddr.html

@Lorak-mmk
Copy link
Collaborator

We can of course consider wrapping Node (or whatever else we need) in a new Coordinator struct if we may want to add new stuff there (maybe a Option?).

@muzarski
Copy link
Contributor

What I meant is just that the request coordinator is identified via ip + port (SocketAddr).

Than I don't understand why Node is not enough. It has addr field which contains SocketAddr inside: https://docs.rs/scylla/latest/scylla/cluster/enum.NodeAddr.html

But the port in this SocketAddr is not always the port of the connection we routed the request to. AFAIU, this port is set to self.control_connection_endpoint.address().port() obtained here: https://github.com/scylladb/scylla-rust-driver/blob/v1.0.0/scylla/src/cluster/metadata.rs#L608. This port is later propagated for other peers as well (is it even correct???). Notice that the queried rpc_address from system.local/peers is an ip address without any port information.

@wprzytula

This comment was marked as off-topic.

@wprzytula

This comment was marked as off-topic.

@Lorak-mmk

This comment was marked as off-topic.

@wprzytula

This comment was marked as off-topic.

@Lorak-mmk

This comment was marked as off-topic.

@wprzytula

This comment was marked as off-topic.

@Lorak-mmk Lorak-mmk modified the milestones: 1.1.0, 1.2.0 Apr 3, 2025
@wprzytula
Copy link
Collaborator Author

Why did you introduce new struct (Coordinator) instead of using Node?

Node is not clonable. Arc<Node> is clonable quite cheaply, but not as cheaply as custom structs filled with a bunch of Copy types are. Therefore, I still believe a tuple (be it a named tuple or a struct) is a better fit here.

@wprzytula wprzytula force-pushed the store-request-coordinator branch 2 times, most recently from c885a61 to e72ed77 Compare April 20, 2025 11:05
@wprzytula wprzytula marked this pull request as ready for review April 20, 2025 11:11
Copy link
Contributor

@muzarski muzarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

Comment on lines 418 to 425
pub struct Coordinator {
/// Translated address, i.e., one that the connection is opened against.
connection_address: SocketAddr,
/// Untranslated address, i.e., one that the node broadcast.
node_address: SocketAddr,
/// Unique ID of the node in the cluster.
host_id: Uuid,
/// Number of the shard, if applicable (present for ScyllaDB nodes, absent for Cassandra).
shard: Option<Shard>,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit: session: introduce Coordinator

  1. @muzarski - please make sure that Coordinator's API meets the needs
    of cpp-rust-driver.

I think this is more than enough. The main use case of cass_future_coordinator is something like

// CassNode would be a wrapper over `Coordinator`
CassNode *n = cass_future_coordinator(cass_session_execute(...));
// This would enforce `SingleTargetLoadBalancingPolicy` using the host_id/node_address 
// (and most probably the shard).
cass_statement_set_node(some_statement, n);

The API you proposed, along with SingleTargetLoadBalancingPolicy is more than enough to implement this on cpp-rust-driver side.

Comment on lines 464 to 478
/// The mock coordinator that is used wherever a [Coordinator] is needed but possibly unknown
/// (e.g., on a control connection before the metadata is fetched),
/// including non-[Session] APIs (`MetadataReader::query_metadata()`, `Connection::query_iter()`, etc.).
///
/// Care should be taken not to leak this [Coordinator] to any user-facing API.
pub(crate) const MOCK_COORDINATOR: Coordinator = Coordinator {
host_id: Uuid::from_u128(0x_feed_dead_deaf_deed_cafe),
connection_address: SocketAddr::new(std::net::IpAddr::V6(std::net::Ipv6Addr::LOCALHOST), 42),
node_address: SocketAddr::new(std::net::IpAddr::V4(std::net::Ipv4Addr::LOCALHOST), 2137),
shard: Some(2137),
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit: codewide: introduce Coordinator to QueryResult

Ok, so I believe it's either this or holding Coordinator under Option<> in QueryResult. TBH, I don't have a strong opinion on this. Both have its pros and cons. Maybe I'm leaning slightly towards Option<Coordinator>, as it is not as bug prone as MOCK_COORDINATOR solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to stay by MOCK_COORDINATOR solution, I'd at least 0-initialize the struct.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I though that using some crazy values will bring up our (or someone else's) attention quicker should anything go wrong.

Copy link
Collaborator Author

@wprzytula wprzytula Apr 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you're right in that zero IP address and zero host ID is obviously not a valid state. Maybe let's stay with 2137 shard (or u32::MAX), though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I believe it's either this or holding Coordinator under Option<> in QueryResult. TBH, I don't have a strong opinion on this. Both have its pros and cons. Maybe I'm leaning slightly towards Option<Coordinator>, as it is not as bug prone as MOCK_COORDINATOR solution.

If we put Option<Coordinator> into QueryResult, shall we return Option<Coordinator> from QueryResult::request_coordinator() or rather panic on None?

What my point was, was that user should never be able to observe a None Coordinator. This is because the only requests whose Coordinator is unknown are driver-private, and their results are not exposed to the user. Thus the user should not be told to even consider the None case. Out of the two options:

  1. to assert the correct behaviour in runtime by panicking at invalid use point (request_coordinator() is called on a QueryResult without a known Coordinator, or
  2. to hide the type-level possibility of incorrect behaviour by introducing a mock. This let us pretend on the QueryResult level that there's no hazard - because I thought there would be no benefit from such visual threat.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you're right in that zero IP address and zero host ID is obviously not a valid state. Maybe let's stay with 2137 shard (or u32::MAX), though.

u32::MAX sounds good

@wprzytula wprzytula force-pushed the store-request-coordinator branch from e72ed77 to 7e9f696 Compare April 22, 2025 08:17
@github-actions github-actions bot removed the semver-checks-breaking cargo-semver-checks reports that this PR introduces breaking API changes label May 8, 2025
@wprzytula
Copy link
Collaborator Author

Solved the problem around lack of UnwindSafe autoimplementation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit: codewide: introduce Coordinator to QueryResult

One possible solution (which we should definitely not pursue now, maybe in distant future it this becomes a problem) is to not use QueryResult internally, and instead have some InternalQueryResult that does not contains Coordinator.

Copy link
Collaborator Author

@wprzytula wprzytula May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also imagined that. However, the amount of boilerplate that this would require scared me too much to consider it further.

Comment on lines +59 to +63
fn into_query_result_and_paging_state_with_maybe_unknown_coordinator(
self,
request_coordinator: Option<Coordinator>,
) -> Result<(QueryResult, PagingStateResponse), RequestAttemptError> {
let (raw_rows, paging_state_response) = match self.response {
let Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate names that are this long, but I don't see a better solution.

wprzytula added 5 commits May 9, 2025 06:50
It was a leftover from the deserialization refactor.
A number of Connection's private API methods:
- `query_single_page()`,
- `query_single_page_with_consistency()`,
- `execute_unpaged()`,
- `batch()`,
were not used at all and were decorated with `#[allow(dead_code)]`.
The funny part is that it was I who decorated them as such.
I decided it's time to say goodbye to them. There's no use in keeping
them, especially that internal refactors often require adjusting them,
too (which was the case in this PR).

Additionally, `query_raw_unpaged()` and `execute_raw_unpaged()` had
their visibility reduced to `pub(self)`, and `execute_raw_unpaged()` was
made `#[cfg(test)]`. These measures decrease risk of misusing those APIs
from outside `Connection`. One exemplary hazard that justifies this
action is that those functions ignore execution profiles.
`QueryResponse` -> `QueryResult` can be performed by calling
`query_response.into_non_error_query_response().into_query_result()`,
so there's no need for direct `into_query_result()` conversion method
on `QueryResponse`. It's removed for easier refactor in further commits.
This is a tiny adjustment to only call `Connection::get_connect_address`
once, not unnecessarily twice.
`Session` started to have two separate `impl` blocks after
deserialization refactor, as a legacy. Those blocks are now merged.
@wprzytula wprzytula force-pushed the store-request-coordinator branch from 2ad215a to 19beb7a Compare May 9, 2025 04:50
@wprzytula
Copy link
Collaborator Author

Rebased on main.

wprzytula added 7 commits May 9, 2025 06:54
`Coordinator` represents the coordinator of a CQL request. It's intended
to be built by `Session` and `QueryPager` upon requests issued to a
given target (node+shard), and exposed in both `QueryResult` and
`QueryPager`.
In order to keep structs that are going to contain Coordinator implement
auto traits (in our case, UnwindSafe and RefUnwindSafe are important),
we need to have all their substructs implement those traits, too.
Otherwise, we would break the public API by implicitly removing
implementation of those traits from structs that are going to hold
Coordinator.

Coordinator contains Node, which contains NodeConnectionPool, which in
turn contains tokio::mpsc::Sender. This implements UnwindSafe and
RefUnwindSafe only since tokio 1.40, so we bump tokio dependency.
When two years ago I added the `endpoint` field to `NodeConnectionPool`,
I forgot to include it in its custom `Debug` impl. Now it's fixed.
These implementations are a temporary solution to the following problem:
`QueryResult` used to implement `(Ref)UnwindSafe`, but then we wanted it
to store a reference to `Node`. This, transitively, made it store
`NodeConnectionPool`, which did not implement those traits.
Thus, they would no longer be auto-implemented for `QueryResult`,
breaking the public API. Not to introduce an API breakage in a minor
release, we decided to manually hint that `NodeConnectionPool` is indeed
unwind-safe. Even if our we are wrong and the hint is misleading,
the documentation of those traits, not being `unsafe` traits, considers
them merely guidelines, not strong guarantees.
`QueryResult` now contains `Option<Coordinator>`. For this,
`NonErrorQueryResponse::into_query_result()` now accepts `Coordinator`.
It is constructed in `Session::run_request()` for each target from
`Plan` that is tried.

Note that `QueryResult::request_coordinator()` return `&Coordinator`
even though `QueryResult` may possibly hold `None`. If `None` is there,
the driver panics upon the getter called. Rationale: some private
`Connection` APIs (namely, `query_raw_unpaged()`) return `QueryResult`
even though they don't have access to `Node`, which is required to
construct a `Coordinator`. Even if we redesigned this particular API
to be able to create a `Coordinator`, we would still be left with
the fundamental chicken-and-egg problem that occurs when we use
the `Connection` APIs on the control connection to issue the initial
metadata fetch. In such case we have 0 chance to have the `Node`
accessible (as it's only constructed based on the first metadata fetch).

To solve the problem, those APIs construct `QueryResult` using a
dedicated constructor: `new_with_unknown_coordinator()`.
It should be used only in APIs that do not leak `QueryResult` to the
user, so the panic is not triggered.
Similarly to `QueryResult` (and `QueryRowsResult`), `QueryPager`
(as well as `TypedRowStream`) stores coordinators of CQL requests it
performed.
In the special case of `Connection::{query,execute}_iter()`, Coordinator
is unknown. Fortunately, we can simply leave the `Vec<Coordinator>`
empty, returning empty Iterator when asked.
`make static` didn't check the code when built with `cpp_rust_unstable`
cfg parameter.
@wprzytula wprzytula force-pushed the store-request-coordinator branch from 19beb7a to b5a744e Compare May 9, 2025 05:28
@wprzytula
Copy link
Collaborator Author

v1.3:

  • Moved Coordinator to response/coordinator.rs.

@wprzytula wprzytula requested review from Lorak-mmk and muzarski May 9, 2025 05:30
@wprzytula wprzytula merged commit 32d179c into scylladb:main May 9, 2025
12 checks passed
@wprzytula wprzytula deleted the store-request-coordinator branch May 9, 2025 09:40
@wprzytula wprzytula mentioned this pull request May 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp-rust-driver-p1 Functionality required by cpp-rust-driver enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose request coordinator id (host_id, address) in QueryResult
3 participants