Skip to content

[exploration] Use crossbeam MPMC channel instead of std::sync::mpsc #486

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

kimsnj
Copy link
Contributor

@kimsnj kimsnj commented Sep 19, 2019

For issue #440, I've played a bit with crossbeam and fd.

The implementation of the exec method is as expected simpler as it doesn't need any mutex or arc.

In term of binary size (in release mode with musl target), the difference is quite small:

fd_crossbeam_bounded:   3605520
fd_crossbeam_unbounded: 3605512
fd_std_mpsc:            3611064

However, on my machinempsc seems to perform better without --exec :

Benchmark #1: ./fd_std_mpsc -HI '.*[0-9]\.rs$' ~
  Time (mean ± σ):      1.018 s ±  0.004 s    [User: 10.319 s, System: 1.464 s]
  Range (min … max):    1.012 s …  1.029 s    10 runs

Benchmark #2: ./fd_crossbeam_bounded -HI '.*[0-9]\.rs$' ~
  Time (mean ± σ):      1.086 s ±  0.026 s    [User: 10.510 s, System: 2.048 s]
  Range (min … max):    1.046 s …  1.135 s    10 runs

Benchmark #3: ./fd_crossbeam_unbounded -HI '.*[0-9]\.rs$' ~
  Time (mean ± σ):      1.088 s ±  0.019 s    [User: 10.593 s, System: 1.994 s]
  Range (min … max):    1.066 s …  1.126 s    10 runs

Summary
  './fd_std_mpsc -HI '.*[0-9]\.rs$' ~' ran
    1.07 ± 0.03 times faster than './fd_crossbeam_bounded -HI '.*[0-9]\.rs$' ~'
    1.07 ± 0.02 times faster than './fd_crossbeam_unbounded -HI '.*[0-9]\.rs$' ~'
Benchmark #1: ./fd_std_mpsc -HI '.*

With --exec, it depends on the number of outputs of results. With a search that yields more than 5.000 results, crossbeam is a bit faster:

[0-9]\.rs$' ~ -x echo {}
  Time (mean ± σ):      7.374 s ±  1.901 s    [User: 18.191 s, System: 23.729 s]
  Range (min … max):    5.439 s … 10.620 s    10 runs

Benchmark #2: ./fd_crossbeam_bounded -HI '.*[0-9]\.rs$' ~ -x echo {}
  Time (mean ± σ):      7.005 s ±  1.426 s    [User: 18.130 s, System: 22.579 s]
  Range (min … max):    5.841 s … 10.711 s    10 runs

Benchmark #3: ./fd_crossbeam_unbounded -HI '.*[0-9]\.rs$' ~ -x echo {}
  Time (mean ± σ):      7.838 s ±  1.366 s    [User: 18.465 s, System: 26.202 s]
  Range (min … max):    5.794 s … 10.574 s    10 runs

Summary
  './fd_crossbeam_bounded -HI '.*[0-9]\.rs$' ~ -x echo {}' ran
    1.05 ± 0.35 times faster than './fd_std_mpsc -HI '.*[0-9]\.rs$' ~ -x echo {}'
    1.12 ± 0.30 times faster than './fd_crossbeam_unbounded -HI '.*[0-9]\.rs$' ~ -x echo {}'

But not anymore with a search that matches ~150 results,

Benchmark #1: ./fd_std_mpsc -HI '.*[0-9]\.jpg$' ~ -x echo {}
  Time (mean ± σ):      1.210 s ±  0.020 s    [User: 11.063 s, System: 2.122 s]
  Range (min … max):    1.187 s …  1.252 s    10 runs

Benchmark #2: ./fd_crossbeam_bounded -HI '.*[0-9]\.jpg$' ~ -x echo {}
  Time (mean ± σ):      1.229 s ±  0.011 s    [User: 11.269 s, System: 2.123 s]
  Range (min … max):    1.207 s …  1.244 s    10 runs

Benchmark #3: ./fd_crossbeam_unbounded -HI '.*[0-9]\.jpg$' ~ -x echo {}
  Time (mean ± σ):      1.226 s ±  0.020 s    [User: 11.134 s, System: 2.256 s]
  Range (min … max):    1.196 s …  1.250 s    10 runs

Summary
  './fd_std_mpsc -HI '.*[0-9]\.jpg$' ~ -x echo {}' ran
    1.01 ± 0.02 times faster than './fd_crossbeam_unbounded -HI '.*[0-9]\.jpg$' ~ -x echo {}'
    1.02 ± 0.02 times faster than './fd_crossbeam_bounded -HI '.*[0-9]\.jpg$' ~ -x echo {}'

I can do more experiments if you'd like.

Cheers,

@sharkdp
Copy link
Owner

sharkdp commented Sep 21, 2019

Very cool, thank you for looking into this!

I ran a few benchmarks on my own (for non-exec commands) and also found that the crossbeam version is mostly slower (by a significant amount, except for one of the benchmarks, where the unbounded version is slightly faster).

bounded-100 refers to a version with a bounded(100) crossbeam channel.

Command Mean [ms] Min [ms] Max [ms] Relative
./fd-master '.*[0-9]\.jpg$' '/home/shark' 200.8 ± 1.5 198.4 203.2 1.00
./fd-crossbeam-bounded-100 '.*[0-9]\.jpg$' '/home/shark' 260.8 ± 11.8 247.0 283.0 1.30
./fd-crossbeam-unbounded '.*[0-9]\.jpg$' '/home/shark' 261.7 ± 9.0 243.4 276.0 1.30
Command Mean [ms] Min [ms] Max [ms] Relative
./fd-master -HI '.*[0-9]\.jpg$' '/home/shark' 524.1 ± 3.6 520.7 532.2 1.00
./fd-crossbeam-bounded-100 -HI '.*[0-9]\.jpg$' '/home/shark' 626.3 ± 12.2 607.2 642.9 1.19
./fd-crossbeam-unbounded -HI '.*[0-9]\.jpg$' '/home/shark' 619.5 ± 12.0 607.8 643.9 1.18
Command Mean [s] Min [s] Max [s] Relative
./fd-master --hidden --no-ignore '' '/home/shark' 1.111 ± 0.027 1.073 1.162 1.08
./fd-crossbeam-bounded-100 --hidden --no-ignore '' '/home/shark' 1.713 ± 0.122 1.516 1.950 1.67
./fd-crossbeam-unbounded --hidden --no-ignore '' '/home/shark' 1.027 ± 0.023 1.000 1.075 1.00
Command Mean [ms] Min [ms] Max [ms] Relative
./fd-master -HI --extension jpg '' '/home/shark' 539.3 ± 3.5 535.0 544.0 1.00
./fd-crossbeam-bounded-100 -HI --extension jpg '' '/home/shark' 674.9 ± 21.0 642.3 709.1 1.25
./fd-crossbeam-unbounded -HI --extension jpg '' '/home/shark' 683.3 ± 16.0 665.9 715.0 1.27
Command Mean [ms] Min [ms] Max [ms] Relative
./fd-master -HI --type l '' '/home/shark' 520.8 ± 3.7 514.8 526.7 1.00
./fd-crossbeam-bounded-100 -HI --type l '' '/home/shark' 624.4 ± 7.5 616.6 638.6 1.20
./fd-crossbeam-unbounded -HI --type l '' '/home/shark' 621.2 ± 14.2 602.0 643.3 1.19

I'm a bit surprised, as crossbeam claims to be faster than std::mpsc, even in the MPSC case (https://github.com/crossbeam-rs/crossbeam-channel/tree/master/benchmarks#results).

A few thoughts:

  • The bounded(…) channel experiment is definitely interesting. However, I don't think that the channel size is necessarily related to MAX_BUFFER_SIZE, like in your example. Where could we profit from a bounded channel? Could this potentially help us with memory issues like Excessive memory usage. #471?
  • I ran a short experiment where I printed the current size of the channel in the unbounded case. For cases with many search results (fd .), the channel size very grows rapidly to sizes of 100,000 and more! I guess this means that we are limited by the speed of the receiver thread which can not print results fast enough (even in the non-colored case, which is much faster). This could be an interesting route for optimization which was also discussed in the past (fd without pattern is much slower than find #304 (comment)).
  • Regarding the last point, another thing that could be interesting would be to "render" the output line in the senders (remove the ./ prefix, colorize the path, etc). This way, we could move even more work into the senders, profiting from parallelization.

@sharkdp
Copy link
Owner

sharkdp commented Apr 3, 2020

Going to close this for now. Thank you very much for this experiment!

@sharkdp sharkdp closed this Apr 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants