Skip to content

RFC(sharder): Scaling from the threads to the cores and beyond #8084

@kyranet

Description

@kyranet

Preface: this RFC (Request For Comments) is meant to gather feedback and opinions for the feature set and some implementation details for @discordjs/sharder! While a lot of the things mentioned in this RFC will be opinionated from my side (like what I’d like to see implemented in the module), not everything might be added in the initial version or some extra things might be added as extras.

This RFC is split into several parts, described below. As an important note, through this entire specification, "shard" is not used to describe a gateway shard, those are handled by @discordjs/ws (see #8083). "Shard" is used to describe the client that the manager instantiates. Those could be (but are not limited to): workers, processes, or servers.

Before any work is done on this, I would like to ask them kindly to not take much if any inspiration from the current shard manager, and that this implementation needs to be as agnostic as possible so it can be used by bots made with discord.js's components (or even with older versions of discord.js).

Yes, I'm also aware this is over-engineered, but everything is designed with composability in mind, and has been thought with experience from implementing this in #7204. Truth be told, without some of those components, certain things would be significantly harder to implement.


ShardManager

Alternatively: Sharder, ProcessManager, ClientManager

This is the entry point in using this module. Its job is to handle the creation of shards, and is the central place where shards can intercommunicate. This manager also sets several parameters via environment variables (if available) to configure the clients.

As a result, a ShardManager has access to one or more ShardClients, and has a channel for each one of them.

API

// Error handling: TBD
// Lifecycle handlers: TBD

class ShardManager {
	public constructor(options: ShardManagerOptions);
}

interface ShardManagerOptions {
	strategy?: ChannelHandler | string;
}

Note: Shards will be configured in the strategy's options. The reasoning for this is simple, not all strategies have to be single-layer. There are also proxies which act as a second layer,

Strategies

This class also instantiates shards in one of the following four ways:

  • Worker: using worker_threads, it has the benefit of faster IPC than process-based strategies.
  • Fork: using child_process.fork, it's how the current shard manager creates the processes, and has the benefit that you can separate the manager from the worker.

    Note: If you open ports in the processes, you will have to open them at different ones, otherwise it will error.

  • Cluster: using cluster.fork, unlike the fork mode, ports are shared and load-balanced.

    Note: You will have to handle the cluster-worker logic in the same file.

  • Network: this is meant for cross-server sharding, and requires ShardManagerProxy. More information below.

Custom strategies will also be supported (although if you see something useful, please let us know and we'll consider it!).

Default: Fork?

Lifecycle Handlers

This class will emit events (not necessarily using EventEmitter) when:

  • A shard has been created.
  • A shard has been restarted.
  • A shard has been destroyed.
  • A shard has signalled a new status.
  • A shard has signalled a "ready" status.
  • A shard has signalled an "exit" status.
  • A shard has signalled an "restarting" status.
  • A shard has sent an invalid message. If unset, this will print the error to console.
  • A shard has sent a ping.
  • A shard has not sent a ping in the maximum time-frame specified. If unset, it will restart the shard.

Life checks

All of the manager's requests must automatically timeout within maximum ping time, so if the shards are configured to ping every 45 seconds, and have a maximum time of 60 seconds, then the request timeout must be 60 seconds. This timeout can also be configured globally (defaults to maximum ping time) and per-request (defaults to global value).

Similarly, if a shard hasn't sent a ping within the maximum time, the lifecycle handler is called.


ShardClient

Alternatively: ShardWorker, Worker, Client

This is a class that instantiates the channel with its respective ShardManager (or ShardManagerProxy) using the correct stream, and boils down to a wrapper of a duplex with extra management.

API

// Signals: TBD
// Error handling: TBD

class ShardClient {
	public constructor(options: ShardClientOptions);


	public static registerMessageHandler(name: string, op: () => MessageHandler): typeof ShardClient;
	public static registerMessageTransformer(name: string, op: () => MessageTransformer): typeof ShardClient;
}

interface ShardClientOptions {
	messageHandler?: MessageHandler | string;
	messageTransformer?: MessageTransformer | string;
}

Signals

  • Starting: the shard is starting up and is not yet ready to respond to the manager's requests.
  • Ready: the shard is ready and fully operative.
  • Exit: the shard is shutting down and should not be restarted.
  • Restarting: the shard is shutting down and should be restarted.

MessageHandler

Alternatively: MessageFormat, MessageBroker, MessageSerder

This is a class that defines how messages are serialized and deserialized.

API

type ChannelData = string | Buffer;

class MessageTransformer {
	public serialize(message: unknown, channel): Awaitable<ChannelData>;
	public deserialize(message: ChannelData, channel): Awaitable<unknown>;
}

Strategies

Custom strategies will also be supported (custom format, ETF, YAML, TOML, XML, you name it).

Default: JSON?


MessageTransformer

Alternatively: MessageMiddleware, EdgeMiddleware, ChannelMiddleware

This is a class that defines what to do when reading a message, and what to do when writing one:

  • When reading:
    graph LR
      A[Channel]
      A -->|Raw| B(MessageTransformer)
      B -->|Transformed| C(MessageHandler)
      B -->|Rejected| D[Channel]
      C -->|Deserialized| E[Application]
    
    Loading
  • When writing:
    graph LR
      A[Application]
      A -->|Raw output| B(MessageHandler)
      B -->|Serialized| C(MessageTransformer)
      C -->|Raw| D[Channel]
    
    Loading

Note: This is optional but also composable (e.g. you can have two transformers) with insertion order of execution when writing (and inverse for reading), and may be used to transform data into a compressed, encrypted, decorated, or transformed in any way the developer desires them to be.

[Gzip, Aes256]

When writing, the data will go through Gzip and then to Aes256, while reading goes the opposite direction (Aes256, then Gzip).

API

type ChannelData = string | Buffer;

class MessageTransformer {
	public read(message: ChannelData, channel): Awaitable<ChannelData>;
	public write(message: ChannelData, channel): Awaitable<ChannelData>;
}

Strategies

It is unknown if we will ship any built-in strategy for encryption or transformation, since they're very application-defined, but we will at least provide two from Node.js:

Warning: We may change the functions used if we find a way to work with streams for better efficiency, but the algorithms may not change.

Custom strategies will also be supported (although if you see something useful, please let us know and we'll consider it!).

Default: []


ShardManagerProxy

This is a special class that operates as a shard for ShardManager and communicates to it via HTTPS, HTTP2, or HTTP3/QUIC, depending on the networking strategy used. Custom ones (such as gRPC) may be added. Encryption is recommended, so plain HTTP might not be available. There might also be a need to support SSH tunnels to bypass firewalls for greater security. A ShardManagerProxy is configured in a way similar to ShardManager, supporting all of its features, but also adds logic to communicate with the manager itself.

The proxy may also use a strategy so if a shard needs to send a message to another, which is available within the proxy, no request is made to the parent, otherwise it will send to the parent, which will try to find the shard among the different proxies.


Questions to answer

  • Should we keep the names for the classes, or are any of the suggested alternative names better?
  • Should we use Result<T, E> classes to avoid try/catch and enhance performance?
  • Should we support async in MessageTransformer? Require streams only?
  • Should we support have a raw data approach, or a command approach? If the latter, should we have eval?
    Raw data approach, commands can be implemented on a bot-level. Similarly to how Discord.js just emits events, people are free to build frameworks on top of @discordjs/sharder, so there's no need for us to force a command format, which would pose several limitations on how users can structure their payloads.
  • Should messages be "repliable" (requires tracking as well as message tagging, etc)?
    • Always? Opt-in? Opt-out?
      Opt-in, not all broadcasts or requests need to be replied to.
    • Should we have timeouts?
      Yes, and tasks should be aborted and de-queued once the timeout is done.
    • Should we be able to cancel a request before/without timeout?
      I personally think this could be beneficial, but the implementation remains to be seen.
  • What should be done when a message is sent from shard A to shard B, but B hasn't signalled "ready" (and is not "exit")? Should it wait for it to signal ready, or abort early? If the former, should we allow an option to "hint" how long a shard takes to ready?
  • Is there a need for an error signal, or do we use exit for it?
  • If a shard exits with process.exit() without signalling the manager (or proxy), should it try to restart indefinitely? Have a limit that resets on "restart"?
  • If ShardManager dies, what should the shards do?
  • If a ShardManagerProxy goes offline, what should its shards do? What should the manager do? Load-balance to other proxies?
  • If a ShardManagerProxy is offline at spawn, what should the manager do? Load-balance to other proxies?
  • If a ShardManagerProxy goes back online, should the manager load-balance its shards back to it?
  • NEW: Should the HTTP-based channel strategies be pooling or open on demand?
    The WebSocket strategy obviously needs to be open at all times, and keeping a connection open makes it easier for messages to be sent in both ways, as well as they reduce latency since they need to do fewer network hops as they don't require a handshake on every request, but it'll also require an advanced system to properly read messages, and will also make aborting requests harder.
  • NEW: Should different ShardManagerProxys be able to communicate directly to each other, similar to P2P? Or should it be centralized, requiring the ShardManager?
    I personally think there isn't really much a need for this, a centralized cache database (such as Redis) could store the information, same for RabbitMQ or alternatives. Is there a need for each proxy to know each other? One pro of having this is that it makes proxies more resilient to outages from the manager, but one downside is that it would need to send a lot of data between proxies and the manager to keep all the information synchronized. I believe it's best to leave this job to Redis and let the user decide how it should be used. After all, proxies are very configurable.
  • NEW: Should we allow ShardManagerProxy to have multiple managers, or stick to a single one for simplicity?
    This way, managers can be mirrored, so if the main one is down, the proxies connect to a secondary one. How the data is shared between managers is up to the user, we will try to give the hooks required for the storage to be easy to store and persist.
  • NEW: What should ShardManagerProxy if all the managers are offline? Should they just stay up?
    Their capabilities may be limited, since they can't load-balance by themselves, and they may not be able to communicate to each other without P2P communication or a central message broker system, but most of the operations should, in theory, be able to stay up. I believe we can make the behaviour customizable by the developer, with the default of keeping the work and try to reconnect once the managers are up.

Anything else missing

If you think this RFC is missing any critical/crucial bits of information, let me know. I will add it to the main issue body with a bolded EDIT tag to ensure everyone sees it (if they don't want to check the edited history).

Changelog

  • Added answers to questions 4 and 5.
  • Added four more questions (13, 14, 15, 16), all related to ShardManagerProxy.

Metadata

Metadata

Assignees

No one assigned

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions