Skip to content

Handle ChainClientErrors in communicate_with_quorum. #2871

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

afck
Copy link
Contributor

@afck afck commented Nov 12, 2024

Motivation

communicate_with_quorum currently takes a call operator that returns a Result<_, NodeError> (async). However, we pass in tasks that use the local node, too, so not all potential errors are NodeErrors.

Proposal

Make communicate_with_quorum deal with ChainClientErrors instead. Internally, only NodeErrors are counted by vote weight; other errors cause the call to fail immediately.

Also don't convert every ChainClientError into a NodeError anymore. Instead, distinguish whether this could possibly be the fault of the remote node (then it must be a NodeError) or not.

Finally, I'm moving handle_optimized_certificate to RemoteNode and next_block_heights to LocalNode, because they only use that, respectively. And the checks for the remote node's blob errors are moved out of LocalNode::find_missing_blobs.

Test Plan

CI should catch regressions.

Release Plan

  • Nothing to do / These changes follow the usual release cycle.

Links

@afck afck marked this pull request as ready for review November 12, 2024 16:05
.update_local_node_with_blobs_from(blob_ids, remote_node)
.await
{
warn!("Error updating local node with blobs: {error}");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we don't propagate this error anymore? Why is that safe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before, we were converting all kinds of errors (including local ones) that would happen in update_local_node_with_blobs_from into NodeErrors.

I agree this here isn't ideal either. I can look into this more closely tomorrow and see if there's a way to clearly distinguish local vs. remote errors here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, weren't we converting errors from update_local_node_with_blobs_from to ChainClientErrors? Wasn't what we were doing fine already (propagating the error)? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then in synchronize_chain_state we converted them all to NodeErrors, regardless of whether they were actually the remote node's fault or our own. Now that we don't do that anymore, we need to be very confident that any error other than NodeErrors returned here can't be triggered by a single faulty validator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will be quite a large PR in itself, so I created #2875 for it.

Comment on lines 213 to 236
// Find the missing blobs locally and retry.
let required = match certificate.value() {
CertificateValue::ConfirmedBlock { executed_block, .. }
| CertificateValue::ValidatedBlock { executed_block, .. } => {
executed_block.required_blob_ids()
}
CertificateValue::Timeout { .. } => HashSet::new(),
};
for blob_id in blob_ids {
if !required.contains(blob_id) {
warn!(
"validator requested blob {:?} but it is not required",
blob_id
);
return Err(NodeError::UnexpectedEntriesInBlobsNotFound.into());
}
}
let unique_missing_blob_ids = blob_ids.iter().cloned().collect::<HashSet<_>>();
if blob_ids.len() > unique_missing_blob_ids.len() {
warn!("blobs requested by validator contain duplicates");
return Err(NodeError::DuplicatesInBlobsNotFound.into());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part isn't mentioned in the PR description ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this was moved out of LocalNode::find_missing_blobs, because these ones are the fault of the remote node.

.await;

match &result {
Err(original_err @ NodeError::BlobsNotFound(blob_ids)) => {
// Find the missing blobs locally and retry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this isn't done in this PR but I'm not sure I understand the flow here – we try to handle_certificate but if it returns an error about missing blobs, we look for them locally (?) and retry. If they are found locally, then why would the first call fail with the error at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We make the remote node handle the certificate. If the remote node is missing blocks, we try to find them in the local node and send them along.

@afck afck force-pushed the communicate-errors branch 2 times, most recently from 2ca4ad8 to 6834046 Compare November 13, 2024 10:32
@afck afck changed the title Communicate errors Handle ChainClientErrors in communicate_with_quorum. Nov 13, 2024
@afck afck force-pushed the communicate-errors branch from 6834046 to 9ad2d2c Compare November 14, 2024 12:09
@afck
Copy link
Contributor Author

afck commented Dec 4, 2024

Closing this for now, since it's partly done and otherwise has lots of merge conflicts; will revisit later.

@afck afck closed this Dec 4, 2024
@afck afck deleted the communicate-errors branch December 4, 2024 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Distinguish local and remote errors in communicate_with_quorum
3 participants