Skip to content

Welcome node receive wrong notification from validation nodes #1665

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Neylix opened this issue Mar 17, 2025 · 3 comments
Open

Welcome node receive wrong notification from validation nodes #1665

Neylix opened this issue Mar 17, 2025 · 3 comments
Labels
bug Something isn't working

Comments

@Neylix
Copy link
Member

Neylix commented Mar 17, 2025

Describe the problem you discovered

In certain edge cases, a validation node might become isolated (e.g., due to network issues) and fail to communicate with other validation nodes. If so, the isolated node may trigger a validation timeout due to missing updates from the coordinator or cross-validation nodes.

When the timeout occurs, the isolated node reports a transaction timeout to the Welcome Node, which notifies the client of a timeout error. However, if other validation nodes successfully process the transaction, the transaction may still be finalized on the network. This creates a discrepancy: the client receives a false timeout error despite the transaction ultimately succeeding.

Describe the solution you'd like

We should find a way to handle isolated node reporting wrong error to the welcome node.
A possible solution could be the following:
When the welcome node send the transaction to validation nodes, it count the number of positive response from them and if it receive validation error the welcome node should wait to have at least 2 errors if more than one validation nodes responded positively.

We should also handle this behavior when a transaction is forwarded by the welcome node

Epic

No response

@Neylix Neylix added the bug Something isn't working label Mar 17, 2025
@Bantarus
Copy link

Okay, let's break down this issue and outline a solution based on the analysis of the codebase structure.

Problem Recap:

The core issue is that the Welcome Node (WN), which communicates with the client, can receive premature and potentially incorrect timeout errors from individual Validation Nodes (VNs) that become isolated during the DistributedWorkflow managed by the Mining component. Even if the transaction succeeds because a quorum of other VNs completes the process, the client might be told it timed out.

Analysis & Relevant Components:

  • Mining Component: Specifically, the lib/archethic/mining/distributed_workflow.ex module is central. It orchestrates the multi-node validation process involving a Coordinator node and multiple VNs. It handles communication, context sharing, cross-validation, and timeout logic.
  • P2P Component: Provides the underlying communication infrastructure. Network issues handled by P2P can lead to the isolation scenario.
  • Welcome Node (Implicit Role): The node that initially receives the transaction from the client and likely kicks off the Mining.start/5 process. It's responsible for reporting the final outcome back to the client.

Proposed Solution: Coordinator-Centric Status Reporting

The most robust way to handle this is to make the Coordinator node within the DistributedWorkflow the sole authority for determining and reporting the final transaction status back to the Welcome Node. Individual VNs should report their status (success, timeout, or error) only to the Coordinator.

Here's how it would work:

  1. VN Reporting to Coordinator:

    • If a VN successfully validates its part, it sends its CrossValidationStamp to the Coordinator (as it likely does now).
    • If a VN encounters an internal timeout (e.g., waiting for context from the Coordinator or messages from other VNs), it sends a specific {:vn_timeout, vn_pub_key} message to the Coordinator.
    • If a VN definitively fails validation for a reason other than timeout, it sends {:vn_validation_error, vn_pub_key, reason} to the Coordinator.
    • Crucially: VNs do not directly report timeouts or intermediate failures back to the Welcome Node.
  2. Coordinator Aggregation and Decision Logic:

    • The Coordinator collects responses from all assigned VNs.
    • It waits until either:
      • It receives responses from a predefined quorum (e.g., > 2/3) of VNs.
      • A global timeout for the coordination process is reached.
    • Decision:
      • Success: If the Coordinator receives a quorum of valid CrossValidationStamps, it aggregates them into the final ValidationStamp, distributes the stamp as necessary, and reports :success (with the final stamp/result) to the Welcome Node.
      • Validation Failure: If the Coordinator receives a quorum of {:vn_validation_error, ...} messages, it aggregates the reasons and reports :validation_failure to the Welcome Node.
      • Timeout Failure: If the Coordinator itself times out waiting for a quorum of any kind of response (success, error, or VN timeout), only then does it report a definitive :timeout_failure to the Welcome Node. This signifies that the overall process could not conclude.
  3. Welcome Node Reporting to Client:

    • The Welcome Node (or the process that initiated the mining workflow) waits for only one definitive status report (:success, :validation_failure, or :timeout_failure) from the Coordinator process.
    • It relays this final, aggregated status to the client.

Benefits:

  • Accuracy: The client receives the actual outcome determined by the quorum, not a potentially misleading error from an isolated minority node.
  • Centralized Logic: The decision logic is consolidated within the Coordinator's role in the DistributedWorkflow, making it easier to manage and reason about.
  • Handles Isolation: An isolated VN reporting a timeout to the Coordinator will simply be one data point; it won't prematurely cause failure unless a quorum of nodes times out or fails.

Implementation Steps (Conceptual):

  1. Modify lib/archethic/mining/distributed_workflow.ex:
    • Adjust the VN's state machine/logic to send specific vn_timeout or vn_validation_error messages to the Coordinator upon encountering such issues. Remove any logic that reports these intermediate failures directly back to the original requester (Welcome Node).
    • Enhance the Coordinator's state machine/logic to:
      • Tally the different types of responses received from VNs (CrossValidationStamp, vn_timeout, vn_validation_error).
      • Implement the quorum-based decision logic described above.
      • Introduce a Coordinator-level timeout for receiving a quorum of responses.
      • Send a single, final status message back to the process that initiated the workflow (likely the Welcome Node's process).
  2. Modify lib/archethic/mining.ex (Potentially):
    • Ensure the function that starts the workflow (start/5 or similar) is set up to receive and correctly interpret the single, final status message from the DistributedWorkflow's Coordinator process.
  3. Adapt Welcome Node Logic:
    • Ensure the code path handling the client request waits for the final status from the Mining component before replying to the client.

This approach directly addresses the discrepancy by ensuring the final status reported reflects the consensus of the validation group, not the potentially erroneous state of an isolated participant. It should apply correctly whether the transaction was submitted directly or forwarded.

(Analyse made with gemini-2.5-pro-exp-03-25)

@samuelmanzanera
Copy link
Member

This seems interesting, however, there is some centralization point in the coordinator to determine the validity or the timeout decision of the transaction. I think @Neylix intention was a bit more decentralized because it's up to the welcome to decide if the tx should be timed-out.
If we want to leverage the coordinator synchronization, we could take the signatures of timeout and aggregating them, could give more trust on the welcome node. So it would be able to accept a coordinator timeout aggregation based on the number of validator signatures.
Another pitfall of the coordination sync of timeouts, if it' s unavailability of the coordinator node, if the node went off/down, it would be single point of failure, while using the @Neylix approach would provide more fault tolerance.

@Bantarus
Copy link

Hi @samuelmanzanera ,
First i hope you are doing well ( not Gemini this time x) )
If the coordinator sync is already happening we could combine both approaches :

  • Nominal scenario : WN is waiting for Coordinator final (or intermediary) status response on consensus : easier for the WN to follow a ( consolidated ) overall status.
  • Fallback : If WN detect a time out from the WN or it become unavailable for any reason, WN still wait for a certain ( reasonnable) time the response from the VNs ( Storage Nodes ? ), if VNs nodes quorum valid the transaction then status is success else WN considere that the transaction failed (consensus could not be reached) or that the Coordinator became unavailable before it could build the Validation Stamp and compute the replication tree for the Cross validation nodes.

Otherwise if changing coordinator sync responsabilities is not desired, i can propose an implementation of the original solution proposed by @Neylix .

Workflow summary :

  1. Coordinator Role (Primary Signal):

    • The Coordinator performs its existing duties (Steps 4, 5), aggregates VN responses (CrossValidationStamp, internal vn_timeout, vn_validation_error), and determines a primary status based on quorum rules (:success, :validation_failure, or :timeout_failure).
    • If a decision is reached, the Coordinator sends this single, definitive status to the WN.
  2. VN Role (Fallback Confirmation):

    • Following Step 7 (Atomic Commitment check via peer CrossValidationStamp exchange), if a VN locally determines that consensus has been reached, it sends a confirmation message (e.g., {:consensus_reached, tx_hash}) directly to the original WN.
  3. WN Decision Logic:

    • The WN starts an overall timer (T_total) for the client request.
    • It listens for the first definitive signal to report back:
      • Receives Coordinator :success -> Report Success.
      • Receives Coordinator :validation_failure -> Report Failure.
      • Receives Quorum of VN {:consensus_reached} messages -> Report Success. (This acts as the fallback).
      • Receives Coordinator :timeout_failure -> Record it, but continue waiting until T_total. This signal alone isn't final proof of failure.
    • If T_total expires:
      • If a definitive signal was already received, the status is already reported.
      • If no definitive signal received, but Coordinator :timeout_failure was recorded -> Report Timeout (Coordinator timed out, and VN fallback confirmation didn't arrive either).
      • If no definitive signal received, and Coordinator :timeout_failure was not recorded -> Report Timeout/Unknown (General timeout, WN never heard back definitively).

Note : Im using this issue to experimentally check the archethic-node context awareness of Gemini thanks to the knowledges i feed him with, in an attempt to integrate AI workflow in the resolution of archethic-node github issues. If you considere this experiment not welcome or ineffective, let me know and i will not go further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants