Skip to content
This repository was archived by the owner on Jun 26, 2023. It is now read-only.

fix: allow pubsub rpc to be processed concurrently #106

Merged
merged 6 commits into from
Sep 20, 2021

Conversation

wemeetagain
Copy link
Member

#103 subtlely changed the pubsub rpc/message processing pipeline from fully concurrent to fully sequential. This is subtle because it seems the only part of processing that may be asynchronous is message validation. So in many cases, when message validation is not heavy, or when few messages are received, no delay in rpc/message processing is detectable.

This PR restores the original behavior, of allowing rpc/messages to be processed concurrently, but with care of ensuring that errors are handled and will not propagate to a top-level unhandled exception.

Copy link
Member

@achingbrain achingbrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but it reintroduces unbounded concurrency which means a remote user could swamp a process that has slow/expensive message handling - please can you update this PR to use it-parallel (and make the parameter configurable) to let the user tune this behaviour and then we can get this in.

@wemeetagain
Copy link
Member Author

How can we use it-parallel and effectively curb the concurrency?
just wrapping the stream in parallel doesn't seem like it will do anything? (at least while not awaiting each _processRpc)

To allow processing pubsub messages that have steps that are slow but
async, process the messages in a queue.

Makes the concurrency configurable with a default of 10x messages.
@achingbrain
Copy link
Member

@wemeetagain I think you're right, wrapping _processRpcMessage invocations in it-parallel won't help if _processRpc gets invoked repeatedly - I've opened ChainSafe#1 against the branch that makes up this PR, it uses a queue with configurable concurrency to process incoming messages across multiple RPC messages, please let me know what you think.

fix: allow pubsub rpc to be processed concurrently
Copy link
Member

@vasco-santos vasco-santos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good to me!
@achingbrain any other input you would like to give or can I get this shipped?

@achingbrain
Copy link
Member

All good! :shipit:

@vasco-santos
Copy link
Member

thank you all for figuring this out!

@vasco-santos vasco-santos merged commit 52f96b3 into libp2p:master Sep 20, 2021
@dapplion
Copy link

dapplion commented Dec 2, 2021

Sorry for the late comment but I have some concerns with this:

  • Is this queue bounded in capacity? If validation time is Infinite (theoretical example), will this queue cause an OOM?
  • In Lodestar we have our own queues, so this is a queue on a queue adding more complexity and making this harder to reason about in our end.
  • Has this queue been bench-marked? What the performance overhead of this queue?

Also this introduces yet another dependency with its associated NPM security risks, performance costs, etc. I would love to see libp2p one day minimally simple and only with dependencies maintained by the libp2p team + required crypto libs

@wemeetagain
Copy link
Member Author

Here's the implementation: https://github.com/sindresorhus/p-queue/blob/main/source/index.ts

  • It appears the capacity is not bounded
  • somewhat agree, but do you think that there should be some ability to throttle message processing/validation? The original case, pre fix: make tests more reliable #103, had no concurrency control at all, just processed all messages. (And still, would OOM if validation time is infinite)
  • I haven't seen a benchmark (not quite sure what the baseline comparison would be), tho there is a benchmark here: https://github.com/sindresorhus/p-queue/blob/main/bench.ts

@achingbrain
Copy link
Member

achingbrain commented Dec 2, 2021

validate is an async method. It has to be async as part of the flow we verify the signature of the message against the sending key. This is not slow so historically there's been no attempt to guard against the process getting overwhelmed.

The problem, I think, is that we allow the configurable message validation function to be async (perfectly reasonable on paper, since it's executed in the same flow as signature verification - we're async at that point to verify the signature, why not).

Correct me if I'm wrong but as I understand it in Lodestar the message validation function verifies all blocks in the message are available to the application which can take on the order of minutes or more.

I think the intention of message validation is just to say "this message is structurally correct", not to execute involved business logic that can then become a performance concern - if that's the requirement, it should happen after validation in a suitable manner managed by the application.

I'd suggest:

  1. Make the configurable validation functions synchronous - this sends a clear signal to the developer that it should be fast
  2. Remove configurable validation altogether - again, any further message pre-processing that is required can be done by the application

We can then remove the queue introduced here and the extra complexity it introduces.

@dapplion
Copy link

dapplion commented Dec 2, 2021

@achingbrain Some context of Lodestar spec requirements:

The eth2.0 spec forces us to have async validation functions that can potentially long for very long time. For example if you get a block that references a parent that's known but its associated state is not currently in the memory cache Ldoestar must read blocks from disk and replay the state transition function, which takes between 50-200ms per block. The whole process can be very long and must not be synchronous since it would freeze the node preventing it from attesting and doing other mandatory tasks.

The eth2 gossip spec requires us to do the above before deciding if this message is an ACCEPT, IGNORE or REJECT.

I think the intention of message validation is just to say "this message is structurally correct", not to execute involved business logic

This is not how eth2.0 gossipsub works unfortunately

@dapplion
Copy link

dapplion commented Dec 2, 2021

From @wemeetagain message

somewhat agree, but do you think that there should be some ability to throttle message processing/validation? The original case, pre fix: make tests more reliable #103, had no concurrency control at all, just processed all messages. (And still, would OOM if validation time is infinite)

Not really because if the validation time is infinte Lodestar queues will get full and drop incoming messages, preventing OOM.

The problem with this queue at the libp2p level is that you are taking subjective decisions that have widely different trade-offs depending on the app layer is.

@wemeetagain
Copy link
Member Author

The eth2 gossip spec requires us to do the above before deciding if this message is an ACCEPT, IGNORE or REJECT.

Yeah, it's required for the eth protocol, and imo the async validation function a really nice feature to have in general. Any sort of crypto in the application validation might be async.

The problem with this queue at the libp2p level is that you are taking subjective decisions that have widely different trade-offs depending on the app layer is.

I think ideally, whatever machinery included in libp2p would be configurable enough (and performant enough) to suit most applications.
For some context, here's a look at what go-libp2p does (and why this PR doesn't seem so problematic, but incomplete)
https://github.com/libp2p/go-libp2p-pubsub/blob/master/validation.go - validation functions are go routines that execute concurrently in a queue managed by a pool of workers - configurable queue capacity, worker count, etc

Also related: ChainSafe/js-libp2p-gossipsub#86: this go-libp2p approach signals to gossipsub when messages are throttled for appropriate scoring.

But if the validation queue embedded in libp2p isn't workable, could we do something else that is less opinionated? Clearly there needs to be some management of incoming rpc/message/topic validation processing or the node can be easily DoSed or killed. Can we expose that pipeline to applications somehow? Maybe use streams/async iterators, allowing users to wrap or replace parts of the pipeline?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants