Skip to content

Vectorize remove_copy for 4 and 8 byte elements #5062

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from

Conversation

AlexGuteniev
Copy link
Contributor

Follow up on #4987

For now only 4 and 8 byte elements and only AVX2, so that AVX2 masks can work.

This may be doable for 1 and 2 byte elements, but will require a different approach for storing partial vector. Or may be not doable for 1 and 2 byte elements, if any approach will be slower than scalar. In any case, not trying right now to avoid too big PR.

AVX2 mask stores are slower than the usual stores, so not using this approach uniformly.

⏱️ Benchmark results

Benchmark main this
r<alg_type::std_fn, std::uint8_t> 301 ns 270 ns
r<alg_type::std_fn, std::uint16_t> 276 ns 275 ns
r<alg_type::std_fn, std::uint32_t> 336 ns 333 ns
r<alg_type::std_fn, std::uint64_t> 778 ns 761 ns
r<alg_type::rng, std::uint8_t> 278 ns 284 ns
r<alg_type::rng, std::uint16_t> 301 ns 281 ns
r<alg_type::rng, std::uint32_t> 338 ns 331 ns
r<alg_type::rng, std::uint64_t> 779 ns 768 ns
rc<alg_type::std_fn, std::uint32_t> 1445 ns 475 ns
rc<alg_type::std_fn, std::uint64_t> 2187 ns 1101 ns
rc<alg_type::rng, std::uint32_t> 897 ns 472 ns
rc<alg_type::rng, std::uint64_t> 1918 ns 1110 ns

Expectedly remmove_copy vectorized is better than non-vectorized.

Expectedly remmove_copy vectorized does not reach the remove vectorized performance.

As usual, some minor variations in unchanged remove vectorized.

⚠️ AMD benchmark wanted ⚠️

I'm worried about the vmaskmov* timings.
They seem to be bad enough for AMD to turn this into a pessimization.

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner November 1, 2024 19:30
@CaseyCarter CaseyCarter added the performance Must go faster label Nov 1, 2024
@StephanTLavavej StephanTLavavej self-assigned this Nov 1, 2024
@muellerj2
Copy link
Contributor

Benchmark results on a Ryzen 7840HS notebook:

Benchmark main branch this PR
r<alg_type::std_fn, std::uint8_t> 220 ns 220 ns
r<alg_type::std_fn, std::uint16_t> 250 ns 251 ns
r<alg_type::std_fn, std::uint32_t> 330 ns 330 ns
r<alg_type::std_fn, std::uint64_t> 781 ns 806 ns
r<alg_type::rng, std::uint8_t> 220 ns 214 ns
r<alg_type::rng, std::uint16_t> 261 ns 243 ns
r<alg_type::rng, std::uint32_t> 332 ns 336 ns
r<alg_type::rng, std::uint64_t> 872 ns 816 ns
rc<alg_type::std_fn, std::uint32_t> 1365 ns 1413 ns
rc<alg_type::std_fn, std::uint64_t> 942 ns 1650 ns
rc<alg_type::rng, std::uint32_t> 921 ns 1413 ns
rc<alg_type::rng, std::uint64_t> 903 ns 1674 ns

@AlexGuteniev
Copy link
Contributor Author

Benchmark results on a Ryzen 7840HS notebook:

Thanks. Apparently this is not the way to go 😿

@StephanTLavavej
Copy link
Member

Thanks @muellerj2 - I also confirm that this is a pessimization on my desktop 5950X (Zen 3):

Benchmark Before After "Speedup" (less than 1.0 is slower)
r<alg_type::std_fn, std::uint8_t> 355 ns 366 ns 0.97
r<alg_type::std_fn, std::uint16_t> 354 ns 335 ns 1.06
r<alg_type::std_fn, std::uint32_t> 394 ns 401 ns 0.98
r<alg_type::std_fn, std::uint64_t> 1097 ns 1122 ns 0.98
r<alg_type::rng, std::uint8_t> 352 ns 359 ns 0.98
r<alg_type::rng, std::uint16_t> 335 ns 336 ns 1.00
r<alg_type::rng, std::uint32_t> 394 ns 401 ns 0.98
r<alg_type::rng, std::uint64_t> 1098 ns 1108 ns 0.99
rc<alg_type::std_fn, std::uint32_t> 1186 ns 1679 ns 0.71
rc<alg_type::std_fn, std::uint64_t> 1223 ns 2056 ns 0.59
rc<alg_type::rng, std::uint32_t> 1177 ns 1688 ns 0.70
rc<alg_type::rng, std::uint64_t> 1768 ns 2077 ns 0.85

@AlexGuteniev Do you want to rework or abandon this strategy?

@AlexGuteniev
Copy link
Contributor Author

I see no rework possible.

We have the following options, besides abandoning:

  • Accept as is, in hopes future CPUs will get better, and appealing to the fact that Intel speedup is greater than AMD slowdown
  • Implement vendor detection. This takes using more state than just __isa_available and doing at least some detection not in VCRuntime, also will add up to the number of possible configurations.

I expect that none of these are acceptable, but leaving the final decision to you.

@StephanTLavavej StephanTLavavej added the decision needed We need to choose something before working on this label Nov 13, 2024
@StephanTLavavej
Copy link
Member

We talked about this at the weekly maintainer meeting, and although we always appreciate the vast amount of effort you've put into vectorizing the STL, and we're always sad to reject a PR, vendor-specific detection logic indeed seems to us to be a step too far. The STL has never exhibited vendor-specific behavior in the past, and signing up to monitor performance indefinitely and retuning the logic (in addition to complicating test coverage) doesn't appear to be worth the potential benefits here.

@StephanTLavavej StephanTLavavej removed the decision needed We need to choose something before working on this label Nov 13, 2024
@AlexGuteniev AlexGuteniev deleted the remove_copy branch November 14, 2024 05:27
@AlexGuteniev AlexGuteniev restored the remove_copy branch November 15, 2024 16:11
@AlexGuteniev AlexGuteniev deleted the remove_copy branch November 15, 2024 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants