Vectorize `remove_copy` for 4 and 8 byte elements #5062

AlexGuteniev · 2024-11-01T19:30:05Z

Follow up on #4987

For now only 4 and 8 byte elements and only AVX2, so that AVX2 masks can work.

This may be doable for 1 and 2 byte elements, but will require a different approach for storing partial vector. Or may be not doable for 1 and 2 byte elements, if any approach will be slower than scalar. In any case, not trying right now to avoid too big PR.

AVX2 mask stores are slower than the usual stores, so not using this approach uniformly.

⏱️ Benchmark results

Benchmark	main	this
r<alg_type::std_fn, std::uint8_t>	301 ns	270 ns
r<alg_type::std_fn, std::uint16_t>	276 ns	275 ns
r<alg_type::std_fn, std::uint32_t>	336 ns	333 ns
r<alg_type::std_fn, std::uint64_t>	778 ns	761 ns
r<alg_type::rng, std::uint8_t>	278 ns	284 ns
r<alg_type::rng, std::uint16_t>	301 ns	281 ns
r<alg_type::rng, std::uint32_t>	338 ns	331 ns
r<alg_type::rng, std::uint64_t>	779 ns	768 ns
rc<alg_type::std_fn, std::uint32_t>	1445 ns	475 ns
rc<alg_type::std_fn, std::uint64_t>	2187 ns	1101 ns
rc<alg_type::rng, std::uint32_t>	897 ns	472 ns
rc<alg_type::rng, std::uint64_t>	1918 ns	1110 ns

Expectedly remmove_copy vectorized is better than non-vectorized.

Expectedly remmove_copy vectorized does not reach the remove vectorized performance.

As usual, some minor variations in unchanged remove vectorized.

⚠️ AMD benchmark wanted ⚠️

I'm worried about the vmaskmov* timings.
They seem to be bad enough for AMD to turn this into a pessimization.

muellerj2 · 2024-11-06T19:40:17Z

Benchmark results on a Ryzen 7840HS notebook:

Benchmark	main branch	this PR
r<alg_type::std_fn, std::uint8_t>	220 ns	220 ns
r<alg_type::std_fn, std::uint16_t>	250 ns	251 ns
r<alg_type::std_fn, std::uint32_t>	330 ns	330 ns
r<alg_type::std_fn, std::uint64_t>	781 ns	806 ns
r<alg_type::rng, std::uint8_t>	220 ns	214 ns
r<alg_type::rng, std::uint16_t>	261 ns	243 ns
r<alg_type::rng, std::uint32_t>	332 ns	336 ns
r<alg_type::rng, std::uint64_t>	872 ns	816 ns
rc<alg_type::std_fn, std::uint32_t>	1365 ns	1413 ns
rc<alg_type::std_fn, std::uint64_t>	942 ns	1650 ns
rc<alg_type::rng, std::uint32_t>	921 ns	1413 ns
rc<alg_type::rng, std::uint64_t>	903 ns	1674 ns

AlexGuteniev · 2024-11-06T20:07:07Z

Benchmark results on a Ryzen 7840HS notebook:

Thanks. Apparently this is not the way to go 😿

StephanTLavavej · 2024-11-13T18:26:11Z

Thanks @muellerj2 - I also confirm that this is a pessimization on my desktop 5950X (Zen 3):

Benchmark	Before	After	"Speedup" (less than 1.0 is slower)
`r<alg_type::std_fn, std::uint8_t>`	355 ns	366 ns	0.97
`r<alg_type::std_fn, std::uint16_t>`	354 ns	335 ns	1.06
`r<alg_type::std_fn, std::uint32_t>`	394 ns	401 ns	0.98
`r<alg_type::std_fn, std::uint64_t>`	1097 ns	1122 ns	0.98
`r<alg_type::rng, std::uint8_t>`	352 ns	359 ns	0.98
`r<alg_type::rng, std::uint16_t>`	335 ns	336 ns	1.00
`r<alg_type::rng, std::uint32_t>`	394 ns	401 ns	0.98
`r<alg_type::rng, std::uint64_t>`	1098 ns	1108 ns	0.99
`rc<alg_type::std_fn, std::uint32_t>`	1186 ns	1679 ns	0.71
`rc<alg_type::std_fn, std::uint64_t>`	1223 ns	2056 ns	0.59
`rc<alg_type::rng, std::uint32_t>`	1177 ns	1688 ns	0.70
`rc<alg_type::rng, std::uint64_t>`	1768 ns	2077 ns	0.85

@AlexGuteniev Do you want to rework or abandon this strategy?

AlexGuteniev · 2024-11-13T18:41:04Z

I see no rework possible.

We have the following options, besides abandoning:

Accept as is, in hopes future CPUs will get better, and appealing to the fact that Intel speedup is greater than AMD slowdown
Implement vendor detection. This takes using more state than just __isa_available and doing at least some detection not in VCRuntime, also will add up to the number of possible configurations.

I expect that none of these are acceptable, but leaving the final decision to you.

StephanTLavavej · 2024-11-13T22:10:17Z

We talked about this at the weekly maintainer meeting, and although we always appreciate the vast amount of effort you've put into vectorizing the STL, and we're always sad to reject a PR, vendor-specific detection logic indeed seems to us to be a step too far. The STL has never exhibited vendor-specific behavior in the past, and signing up to monitor performance indefinitely and retuning the logic (in addition to complicating test coverage) doesn't appear to be worth the potential benefits here.

AlexGuteniev added 8 commits October 31, 2024 21:21

benchmark

1ba7719

dispatch

8e88ad8

minimum vectorization

9a8a04f

ranges

23c4693

benchmark no 8 or 16 bit remove_copy

033b481

wrong shift

0d02adb

both pointers

aa23eab

some coverage

ea1f299

AlexGuteniev requested a review from a team as a code owner November 1, 2024 19:30

AlexGuteniev added 2 commits November 1, 2024 21:38

out of shadow

f2d578e

missed out of shadow

6df032d

CaseyCarter added the performance Must go faster label Nov 1, 2024

StephanTLavavej self-assigned this Nov 1, 2024

AlexGuteniev added 5 commits November 1, 2024 22:08

ADL

de465da

Fix qualification

c765074

stray comments

d42ad5f

_M_ARM64EC

abe345b

line

79dc863

AlexGuteniev mentioned this pull request Nov 6, 2024

Vector algorithms with AVX2 masked stores and AMD processors #5068

Closed

StephanTLavavej assigned AlexGuteniev and unassigned StephanTLavavej Nov 13, 2024

StephanTLavavej unassigned AlexGuteniev Nov 13, 2024

StephanTLavavej added the decision needed We need to choose something before working on this label Nov 13, 2024

StephanTLavavej closed this Nov 13, 2024

StephanTLavavej removed the decision needed We need to choose something before working on this label Nov 13, 2024

AlexGuteniev deleted the remove_copy branch November 14, 2024 05:27

AlexGuteniev restored the remove_copy branch November 15, 2024 16:11

AlexGuteniev deleted the remove_copy branch November 15, 2024 16:16

AlexGuteniev mentioned this pull request Mar 23, 2025

Vectorize remove_copy and unique_copy #5355

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize `remove_copy` for 4 and 8 byte elements #5062

Vectorize `remove_copy` for 4 and 8 byte elements #5062

AlexGuteniev commented Nov 1, 2024

muellerj2 commented Nov 6, 2024

AlexGuteniev commented Nov 6, 2024

StephanTLavavej commented Nov 13, 2024

AlexGuteniev commented Nov 13, 2024

StephanTLavavej commented Nov 13, 2024

Vectorize remove_copy for 4 and 8 byte elements #5062

Vectorize remove_copy for 4 and 8 byte elements #5062

Conversation

AlexGuteniev commented Nov 1, 2024

⏱️ Benchmark results

⚠️ AMD benchmark wanted ⚠️

muellerj2 commented Nov 6, 2024

AlexGuteniev commented Nov 6, 2024

StephanTLavavej commented Nov 13, 2024

AlexGuteniev commented Nov 13, 2024

StephanTLavavej commented Nov 13, 2024

Vectorize `remove_copy` for 4 and 8 byte elements #5062

Vectorize `remove_copy` for 4 and 8 byte elements #5062