Skip to content

feat(gossipsub): Add MessageBatch #607

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

Conversation

MarcoPolo
Copy link
Contributor

to support batch publishing messages

Replaces #602.

Batch publishing lets the system know there are multiple related messages to be published so it can prioritize sending different messages before sending copies of messages. For example, with the default API, when you publish two messages A and B, under the hood A gets sent to D=8 peers first, before B gets sent out. With this MessageBatch api we can now send one copy of A and then one copy of B before sending multiple copies.

When a node has bandwidth constraints relative to the messages it is publishing this improves dissemination time.

For more context see this post: https://ethresear.ch/t/improving-das-performance-with-gossipsub-batch-publishing/21713

to support batch publishing messages
@MarcoPolo MarcoPolo force-pushed the marco/batch-publishing branch from 8e804a5 to 4059300 Compare April 25, 2025 03:13
Copy link
Member

@raulk raulk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's quite a bit of indirection here.

  1. There's a top-level NewMessageBatch API that accepts a gossipsub router. This seems inverted. Intuitively, I'd expect to initiate a message batch from a router (similar to badger's NewWriteBatch). It seems more idiomatic to me.
  2. The MessageBatch keeps track of pending RPCs, which are added by the GossipSubRouter. To propagate itself through the call stack, it sets an anonymous PublishOption wrapping itself.
  3. There's special casing in several spots in the GossipSubRouter to identify if this is a batch, unwrap the object, and do special things on it like queue the RPCs instead of actually sending them. This results in the paradox that a call to the router's gossipsub Publish doesn't actually publish anything under this mode.
  4. The fact that a Message now embeds a messageBatch is semantically counterintuitive.

All in all, I think this code is (a) hard to maintain, and (b) it exposes a confusing public API. I'm curious what alternative APIs you considered. I'd imagine this dance is a last resort to make it work under the current design/implementation constraints.

Did you consider:

  1. Extending the PubSub, PubSubRouter, GossipSubRouter, etc. type hierarchy with NewBatch and PublishBatch methods?
  2. Reorganizing the reusable code under GossipSubRouter#Publish so the relevant parts can be reused from PublishBatch?

I think this would simplify the whole thing.

For what it's worth, in @cskiraly's the EthResearch post, he hinted at a distinction between message batches and queuing/diffusion disciplines. I think that distinction was lost a bit later, but I'd like to recover it here.

In my head:

  • A message batch is no more than an organizational unit to bundle a set of messages that the router learns about at once (without atomicity guarantees, hence not a Transaction). It does not imply a concrete queuing/diffusion/scheduling discipline.
  • The user should provide a queuing/diffusion/scheduling discipline when calling PublishBatch, in the form of a function or an interface implementation. This discipline encapsulates the conversion to RPCs and handles their subsequent dispatch. In other words, the code that currently lives under MessageBatch#Publish should be modular in itself.

@MarcoPolo
Copy link
Contributor Author

All in all, I think this code is (a) hard to maintain, and (b) it exposes a confusing public API. I'm curious what alternative APIs you considered. I'd imagine this dance is a last resort to make it work under the current design/implementation constraints.

On (a), you're the reviewer so I'll defer to you. I agree there are subtleties here that may introduce footguns in the future example.

On (b), I disagree about this being a confusing public API. To be clear, this is the api users use:

batch, err := NewMessageBatch(pubsub)
// Handle err

for _, msg := range msgs
  err := batch.Add(ctx, topic, msg)
  // Handle err
}
batch.Publish()

Whether this is NewMessageBatch(pubsub) or pubsub.NewMessageBatch() is a fairly minor point. Happy to change.

Whether this is batch.Publish or pubsub.PublishBatch also seems like a fairly minor point, but I'd prefer the former. It's also consistent with your example of WriteBatch.

Did you consider:

  1. Extending the PubSub, PubSubRouter, GossipSubRouter, etc. type hierarchy with NewBatch and PublishBatch methods?
  2. Reorganizing the reusable code under GossipSubRouter#Publish so the relevant parts can be reused from PublishBatch?

I did, but I feared it would end up as a larger refactor. It may be worth it anyways as it may generally improve the codebase and remove future footguns. I'll open a new PR along these lines.


For what it's worth, in @cskiraly's the EthResearch post, he hinted at a distinction between message batches and queuing/diffusion disciplines. I think that distinction was lost a bit later, but I'd like to recover it here.

In my head:

  • A message batch is no more than an organizational unit to bundle a set of messages that the router learns about at once (without atomicity guarantees, hence not a Transaction). It does not imply a concrete queuing/diffusion/scheduling discipline.
  • The user should provide a queuing/diffusion/scheduling discipline when calling PublishBatch, in the form of a function or an interface implementation. This discipline encapsulates the conversion to RPCs and handles their subsequent dispatch. In other words, the code that currently lives under MessageBatch#Publish should be modular in itself.

An earlier draft of this PR allowed users to define the publish strategy. I tested via simulation the rarest message first strategy and the "shuffle" strategy from #602. The rarest first performed better (which intuitively makes sense). To that end, I chose to keep the API simpler by only using the rarest-first strategy. We can always make this configurable, but I think it would be a mistake to prematurely add this extension point now. This is not to say that the current rarest-first implementation is optimal, just that the inputs to the optimal solution may be non-obvious and defining an extension point now when we only have n=1 options is premature.

@vyzo
Copy link
Collaborator

vyzo commented Apr 29, 2025

Please hold your horses in the larger refactor.
This something that is being actively discussed as part of the v2.0 initiative, and we shouldnt rush on it.

@MarcoPolo
Copy link
Contributor Author

Consider my horses held. I'll explore a small refactor that hopes to remove some indirection here.

@raulk
Copy link
Member

raulk commented Apr 29, 2025

@MarcoPolo I'd love to see a version of this PR with the refactor. Happy to pair on it if you'd like!

Re: batch.Publish() vs. pubsub.PublishBatch(), I suspect the latter can reduce the "spooky action at a distance" effect. It accomplishes that by encapsulating the RPC planning, scheduling, and dispatch all in a single place vs. spread across the indirection. Makes it easier to follow; and I'm all for reducing complexity ;-) But hard to tell without taking a stab.

Re: queuing/dispatch disciplines, even if we only support the "prioritize rarest" one at the moment, it still makes sense to introduce the abstraction as long as we have some confidence that it can withstand future disciplines going forward. Deliberate API design signals intentionality and makes a difference in shaping how APIs evolve. However, if you feel strongly against this, I can live without it.

I agree with @vyzo that we don't want a major refactor here, but introducing this feature cleanly (in an already complex and organic codebase) is a win.

@MarcoPolo MarcoPolo marked this pull request as draft April 30, 2025 18:41
@MarcoPolo MarcoPolo force-pushed the marco/batch-publishing branch from c0c96e6 to 7cc3ef8 Compare April 30, 2025 22:59
@MarcoPolo
Copy link
Contributor Author

I made significant changes to the design. Thanks @sukunrt and @raulk for the feedback. I think this approach is clearer. I'd recommend initially reviewing with whitespace changes hidden (?w=1).

Care was taken to ensure batched messages and normal messages go through as much as the same code flows as possible.

Some refactors along the way worth highlighting:

  • Introduce a new validation.ValidateLocal method that does not send a message. This lets us validate a message when adding to a batch without also running the send message logic.
  • ⚠️ Breaking change: validation.validate returns a ValidationError{Reason: RejectValidationIgnoredDuplicate} on duplicate instead of a nil error. The only time you would get this error is if you are publishing two duplicate messages, and it's probably better that you get this error instead of silently doing nothing.

@MarcoPolo MarcoPolo requested a review from raulk April 30, 2025 23:11
@MarcoPolo MarcoPolo marked this pull request as ready for review May 1, 2025 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants