Task System for Bevy

# Task System for Bevy

Currently bevy depends on rayon for multithreaded dispatching of systems. @lachlansneff and @aclysma have iterated on a prototype with feedback from @kabergstrom and @cart. This issue is meant to:
* Capture some of the discussion in discord in #rendering - some of it was about the prototype and some of it was longer term
* Invite further feedback on the near-term and long-term plans

## Why Replace Rayon?

* Rayon has long-standing performance issues with smaller workloads on machines with more than a few cores. (https://github.com/bevyengine/bevy/issues/111)
* Rayon is not async-friendly
* Rayon is somewhat of a closed box, it owns the threads and it would be more difficult to upstream changes that are tuned for games as it’s a general purpose library
* Rayon has a lot more in it than we really need. Our alternative has fewer dependencies, compiles faster, and is substantially less code

## What Would the Alternative Be?

@lachlansneff and @aclysma implemented a prototype using `multitask.` It’s a small executor based on `async-task`, which is used within `async-std`. The dependencies are:

```
├── multitask
│   ├── async-task
│   ├── concurrent-queue
│   │   └── cache-padded
│   └── fastrand
├── num_cpus
│   └── libc
├── parking
└── pollster
```

The API does three things:
* Fire-and-forget `static tasks
* Fork-join for non-static tasks
* Chunked parallel iteration of slices

We have a prototype of ParallelExecutor using it instead of rayon, allowing us to remove rayon as a dependency from bevy. This repo is a prototype and we intend to add it as a module within bevy directly. (bevy_tasks) This will allow us to do more sophisticated instrumentation and profiling than if we were using an externally-managed thread pool.

* Prototype integrated with bevy is here (🚧still under construction🚧): https://github.com/lachlansneff/bevy/tree/bevy-tasks/crates/bevy_tasks/src
* Our original prototype is also available if anyone wants to play with this outside of bevy: https://github.com/lachlansneff/small-task/tree/master/src

## Advantages/Disadvantages

### Advantages:

* Less code, less dependencies
* More async friendly
* Allows us to control and customize it more for bevy
* Solves issue #111

### Disadvantages:
* Rayon has more features

### Tradeoffs:
* Rayon has a large community of maintainers, users, and downstream crates. This can be both good and bad

## Short Term Plan

Finish a PR to add bevy_tasks as a module, remove rayon as a dependency, and update ParallelExecutor to work with bevy_tasks. In principle these steps are already done, but we may want to polish a bit first.

We have a feature branch for this underway here: https://github.com/lachlansneff/bevy/tree/bevy-tasks

## Long Term Plan

Thread management is clearly a key problem that needs to be solved in a scalable game engine. In my opinion there are three main uses of threads in a modern, ECS-based game engine, from high-level to low-level:

1. Large systems that are relatively long running that own their own thread, pipelined with the “main” simulation thread. (See https://github.com/aclysma/renderer_prototype/blob/master/docs/pipelining.png)
2. Dispatching individual ECS systems
3. Jobs being triggered by the systems
    a. Some systems might wait for tasks to complete before returning
    b. Other systems might start jobs to be finished later in the frame or in a future frame

We plan to apply this solution to #2 now, and longer term expose the same underlying system to solve #3. (#1 is out of scope for now, but might also be able to use this system)

We discussed #3 in the #rendering channel in discord:
* @lachlansneff suggested creating a single global threadpool as an atomic static so that it’s always easy and fast to access
* @aclysma suggested adding it as a resource as this is more consistent with the rest of the engine
* @cart: “my default answer is "lets use a resource", but I’m flexible if the context dictates that something static is better”
* @kabergstrom suggested separating IO tasks from compute tasks. The rationale being that IO tasks do not occupy a thread for long as they generally have short wake/sleep cycles. Additionally IO sometimes has latency requirements
* @aclysma suggested binning tasks as:
    * IO: High priority tasks that are expected to spend very little time “awake” (example: feeding an audio buffer)
    * Compute: Tasks that must be completed to render the current frame (example: visibility)
    * Async Compute: Tasks that may span multiple frames and don’t have latency requirements (example: pathfinding)
* @kabergstrom suggested that generally we would partition threads to process specific bins, not oversubscribing threads
    * Example: A system with 16 logical cores might give 4 to IO, 4 to async compute, and 8 to (same-frame) compute
    * There was general agreement that this needs to be tunable as games will have different requirements (i.e. games that are streaming in the environment vs. games that can load everything up front)
* @aclysma suggested that when we do have binned thread pools, we will want to set affinities to physical cores
* Consensus was that for now we will put thread pools in resources as we can new type each specific pool as separate resources
* Having an atomic static method of accessing the thread pool may prove to be a valuable option in the future, but if we add it we will probably want to retain the ability to have separate buckets for different types of tasks (i.e. it would be more like an atomic static global table of task pools rather than a single task pool - or maybe 3 atomic static task pools)

## Configuration of Task Buckets

@lachlansneff and @aclysma also discussed the need to assign threads to the proposed buckets (IO, async compute, and compute). We considered several approaches:
* Some sort of callback that puts the problem completely on the end-user. We would probably want this to be optional - which would mean having some sort of reasonable default if they did not specify a callback
* Some sort of explicit configuration that can be targeted towards particular hardware devices (i.e. a config file that explicitly lists devices (i.e. iphone12, pixel4) and an exact distribution of threads to use)
    * In general, we may need a solution for other systems in the future to tune performance based on the device (examples: LOD distances, pool sizes, limits on number of things spawned, disabling/scaling rendering features, etc. If we add something like this in the future, policy on task/thread distribution would be good to add)
* Could provide a rough policy
    * Stupid simple default: 1 IO thread, 1 async compute thread, max(1, N-2) compute threads
    * Slightly more advanced: %/min/max IO threads, %/min/max async compute threads, min compute threads
    * Example:
        * N = number of logical cores
        * NumIOCores = Clamp(N * 0.1, 1, 4)
            * 10%, at least 1 no more than 4
        * NumAsyncComputeCores = Clamp(N * 0.1, 1, 4)
            * 10%, at least 1, no more than 4
        * NumComputeCores = Max(1, N - NumIOCores - NumAsyncComputeCores)
            * Implicitly 80% of cores, at least 1
* If threads get oversubscribed because there are <=2 cores, we will rely on the OS to fairly timeslice them. 
    * We actually don’t want to try to do this ourselves because the OS is able to preempt a task even if it is long running - this approach should be more resilient against any of the pools being starved
* @lachlansneff and @aclysma agreed that we should go for the “slightly more advanced” policy now with the option to implement a callback for custom configuration later. We will implement in a way that would allow both methods to coexist.

## Potential Future Improvements to bevy_tasks

* Better ergonomics with Tasks/TaskPools/Scopes in multithreaded code
    * A few things require &mut that we might not want to use &mut. For example Scope::spawn. Would be nice if Scope was cloneable and could be passed into futures being spawned.
    * Might also be nice to have TaskPool have an Arc<TaskPoolInner> which would make that easier to pass around too. 
* More doc comments
* Improved Panic Handling: If a task panics in scope() and spawn(), we want well-thought-out ways of surfacing the panic to the caller. We may be getting this somewhat for free by using multitask, but it’s worth having an intentional design around it (i.e. awaiting a panicked task will panic)

### Next Steps
* Gather feedback for the short term plan and find consensus on if we will proceed (PR bevy_tasks and use it to replace rayon)
* Gather additional feedback for the longer term plan (a general approach to multithreaded tasks and how bevy_tasks might help us with it)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Task System for Bevy #318

Task System for Bevy

Why Replace Rayon?

What Would the Alternative Be?

Advantages/Disadvantages

Advantages:

Disadvantages:

Tradeoffs:

Short Term Plan

Long Term Plan

Configuration of Task Buckets

Potential Future Improvements to bevy_tasks

Next Steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Task System for Bevy #318

Description

Task System for Bevy

Why Replace Rayon?

What Would the Alternative Be?

Advantages/Disadvantages

Advantages:

Disadvantages:

Tradeoffs:

Short Term Plan

Long Term Plan

Configuration of Task Buckets

Potential Future Improvements to bevy_tasks

Next Steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions