Handle `ChainClientError`s in `communicate_with_quorum`. #2871

afck · 2024-11-12T15:43:26Z

Motivation

communicate_with_quorum currently takes a call operator that returns a Result<_, NodeError> (async). However, we pass in tasks that use the local node, too, so not all potential errors are NodeErrors.

Proposal

Make communicate_with_quorum deal with ChainClientErrors instead. Internally, only NodeErrors are counted by vote weight; other errors cause the call to fail immediately.

Also don't convert every ChainClientError into a NodeError anymore. Instead, distinguish whether this could possibly be the fault of the remote node (then it must be a NodeError) or not.

Finally, I'm moving handle_optimized_certificate to RemoteNode and next_block_heights to LocalNode, because they only use that, respectively. And the checks for the remote node's blob errors are moved out of LocalNode::find_missing_blobs.

Test Plan

CI should catch regressions.

Release Plan

Nothing to do / These changes follow the usual release cycle.

Links

deuszx · 2024-11-12T16:58:50Z

linera-core/src/client/mod.rs

+                        .update_local_node_with_blobs_from(blob_ids, remote_node)
+                        .await
+                    {
+                        warn!("Error updating local node with blobs: {error}");


So we don't propagate this error anymore? Why is that safe?

Before, we were converting all kinds of errors (including local ones) that would happen in update_local_node_with_blobs_from into NodeErrors.

I agree this here isn't ideal either. I can look into this more closely tomorrow and see if there's a way to clearly distinguish local vs. remote errors here.

Wait, weren't we converting errors from update_local_node_with_blobs_from to ChainClientErrors? Wasn't what we were doing fine already (propagating the error)? 🤔

But then in synchronize_chain_state we converted them all to NodeErrors, regardless of whether they were actually the remote node's fault or our own. Now that we don't do that anymore, we need to be very confident that any error other than NodeErrors returned here can't be triggered by a single faulty validator.

I think this will be quite a large PR in itself, so I created #2875 for it.

deuszx · 2024-11-12T17:02:00Z

linera-core/src/updater.rs

+                // Find the missing blobs locally and retry.
+                let required = match certificate.value() {
+                    CertificateValue::ConfirmedBlock { executed_block, .. }
+                    | CertificateValue::ValidatedBlock { executed_block, .. } => {
+                        executed_block.required_blob_ids()
+                    }
+                    CertificateValue::Timeout { .. } => HashSet::new(),
+                };
+                for blob_id in blob_ids {
+                    if !required.contains(blob_id) {
+                        warn!(
+                            "validator requested blob {:?} but it is not required",
+                            blob_id
+                        );
+                        return Err(NodeError::UnexpectedEntriesInBlobsNotFound.into());
+                    }
+                }
+                let unique_missing_blob_ids = blob_ids.iter().cloned().collect::<HashSet<_>>();
+                if blob_ids.len() > unique_missing_blob_ids.len() {
+                    warn!("blobs requested by validator contain duplicates");
+                    return Err(NodeError::DuplicatesInBlobsNotFound.into());
+                }


This part isn't mentioned in the PR description ;)

Right, this was moved out of LocalNode::find_missing_blobs, because these ones are the fault of the remote node.

deuszx · 2024-11-12T17:03:48Z

linera-core/src/updater.rs

            .await;

        match &result {
            Err(original_err @ NodeError::BlobsNotFound(blob_ids)) => {
+                // Find the missing blobs locally and retry.


I know this isn't done in this PR but I'm not sure I understand the flow here – we try to handle_certificate but if it returns an error about missing blobs, we look for them locally (?) and retry. If they are found locally, then why would the first call fail with the error at all?

We make the remote node handle the certificate. If the remote node is missing blocks, we try to find them in the local node and send them along.

afck · 2024-12-04T16:47:04Z

Closing this for now, since it's partly done and otherwise has lots of merge conflicts; will revisit later.

afck requested review from ma2bd, ndr-ds, deuszx and bart-linera November 12, 2024 16:05

afck marked this pull request as ready for review November 12, 2024 16:05

deuszx reviewed Nov 12, 2024

View reviewed changes

afck mentioned this pull request Nov 13, 2024

Get rid of get_blobs_not_found and unify blob errors #2649

Merged

afck force-pushed the communicate-errors branch 2 times, most recently from 2ca4ad8 to 6834046 Compare November 13, 2024 10:32

afck changed the title ~~Communicate errors~~ Handle ChainClientErrors in communicate_with_quorum. Nov 13, 2024

afck added 3 commits November 14, 2024 12:56

Return ChainClientError from communicate_with_quorum.

2f43ef5

Accept ChainClientError in communicate_with_quorum tasks.

f8df6b6

Add TODO comment.

9ad2d2c

afck force-pushed the communicate-errors branch from 6834046 to 9ad2d2c Compare November 14, 2024 12:09

Move missing blobs check into remote_node.

c546eca

afck mentioned this pull request Nov 14, 2024

Move more functions to the right module in linera-core. #2895

Merged

afck closed this Dec 4, 2024

afck deleted the communicate-errors branch December 4, 2024 16:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle `ChainClientError`s in `communicate_with_quorum`. #2871

Handle `ChainClientError`s in `communicate_with_quorum`. #2871

afck commented Nov 12, 2024 •

edited

Loading

deuszx Nov 12, 2024

afck Nov 12, 2024

ndr-ds Nov 13, 2024

afck Nov 13, 2024

afck Nov 13, 2024

deuszx Nov 12, 2024

afck Nov 12, 2024

deuszx Nov 12, 2024

afck Nov 12, 2024

afck commented Dec 4, 2024

Handle ChainClientErrors in communicate_with_quorum. #2871

Handle ChainClientErrors in communicate_with_quorum. #2871

Conversation

afck commented Nov 12, 2024 • edited Loading

Motivation

Proposal

Test Plan

Release Plan

Links

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

afck commented Dec 4, 2024

Handle `ChainClientError`s in `communicate_with_quorum`. #2871

Handle `ChainClientError`s in `communicate_with_quorum`. #2871

afck commented Nov 12, 2024 •

edited

Loading