-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Handle ChainClientError
s in communicate_with_quorum
.
#2871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
.update_local_node_with_blobs_from(blob_ids, remote_node) | ||
.await | ||
{ | ||
warn!("Error updating local node with blobs: {error}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we don't propagate this error anymore? Why is that safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before, we were converting all kinds of errors (including local ones) that would happen in update_local_node_with_blobs_from
into NodeErrors
.
I agree this here isn't ideal either. I can look into this more closely tomorrow and see if there's a way to clearly distinguish local vs. remote errors here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, weren't we converting errors from update_local_node_with_blobs_from
to ChainClientError
s? Wasn't what we were doing fine already (propagating the error)? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then in synchronize_chain_state
we converted them all to NodeError
s, regardless of whether they were actually the remote node's fault or our own. Now that we don't do that anymore, we need to be very confident that any error other than NodeError
s returned here can't be triggered by a single faulty validator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will be quite a large PR in itself, so I created #2875 for it.
linera-core/src/updater.rs
Outdated
// Find the missing blobs locally and retry. | ||
let required = match certificate.value() { | ||
CertificateValue::ConfirmedBlock { executed_block, .. } | ||
| CertificateValue::ValidatedBlock { executed_block, .. } => { | ||
executed_block.required_blob_ids() | ||
} | ||
CertificateValue::Timeout { .. } => HashSet::new(), | ||
}; | ||
for blob_id in blob_ids { | ||
if !required.contains(blob_id) { | ||
warn!( | ||
"validator requested blob {:?} but it is not required", | ||
blob_id | ||
); | ||
return Err(NodeError::UnexpectedEntriesInBlobsNotFound.into()); | ||
} | ||
} | ||
let unique_missing_blob_ids = blob_ids.iter().cloned().collect::<HashSet<_>>(); | ||
if blob_ids.len() > unique_missing_blob_ids.len() { | ||
warn!("blobs requested by validator contain duplicates"); | ||
return Err(NodeError::DuplicatesInBlobsNotFound.into()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part isn't mentioned in the PR description ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this was moved out of LocalNode::find_missing_blobs
, because these ones are the fault of the remote node.
linera-core/src/updater.rs
Outdated
.await; | ||
|
||
match &result { | ||
Err(original_err @ NodeError::BlobsNotFound(blob_ids)) => { | ||
// Find the missing blobs locally and retry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this isn't done in this PR but I'm not sure I understand the flow here – we try to handle_certificate
but if it returns an error about missing blobs, we look for them locally (?) and retry. If they are found locally, then why would the first call fail with the error at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We make the remote node handle the certificate. If the remote node is missing blocks, we try to find them in the local node and send them along.
2ca4ad8
to
6834046
Compare
ChainClientError
s in communicate_with_quorum
.
6834046
to
9ad2d2c
Compare
Closing this for now, since it's partly done and otherwise has lots of merge conflicts; will revisit later. |
Motivation
communicate_with_quorum
currently takes a call operator that returns aResult<_, NodeError>
(async). However, we pass in tasks that use the local node, too, so not all potential errors areNodeErrors
.Proposal
Make
communicate_with_quorum
deal withChainClientError
s instead. Internally, onlyNodeError
s are counted by vote weight; other errors cause the call to fail immediately.Also don't convert every
ChainClientError
into aNodeError
anymore. Instead, distinguish whether this could possibly be the fault of the remote node (then it must be aNodeError
) or not.Finally, I'm moving
handle_optimized_certificate
toRemoteNode
andnext_block_heights
toLocalNode
, because they only use that, respectively. And the checks for the remote node's blob errors are moved out ofLocalNode::find_missing_blobs
.Test Plan
CI should catch regressions.
Release Plan
Links
communicate_with_quorum
#2857