Skip to content

Vectorize remove_copy and unique_copy #5355

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 41 commits into from
Apr 22, 2025

Conversation

AlexGuteniev
Copy link
Contributor

⚙️ The optimization

remove_copy and unique_copy are different from their non-_copy counterparts in that they don't have room they are allowed to overwrite. This means we can't directly store results from vector registers.

The previous attempt #5062 tried to use masked stores to bypass that limitation. Unfortunately, this doesn't perform well for some CPUs. Also the minimum granularity of AVX2 masked store is 32 bits, so it would not work for smaller elements.

This time, temporary storage comes to rescue. The algorithms already use some additional memory (the tables), so why wouldn't it use a bit more. I arbitrarily picked 512 bytes, should be not too much. Each time the temporary buffer is full, it can be copied to the destination with memcpy, it should be fast enough for this buffer size.

🚫 No find before remove_copy

In #4987, it was explained that doing find before remove is good for both correctness and performance. Originally it was in vectorization code, but during the review @StephanTLavavej observed that it is done in the headers already (#4987 (comment)).

For remove_copy it is not necessary for correctness, and may be harmful for performance. find would need copy in addition, this will be double pass on the input, which can make the performance worse for large input and memry-bound situation.

We may have special handling of the range before the first match in vectorization code, this is another story, and it would not be harmful, but I'm not doing this in the current PR. Maybe later.

So, as we have not called find, and so have not checked if value type can even match iterator value type, we need this _Could_compare_equal_to_value_type check here.

✅ Test coverage

Shared with non-_copy counterparts to save total tests run time and some lines of code, at the expense with otherwise unnecessary coupling.

We check both modified and unmodified destination parts, to make sure unmodified indeed didn't modify.

⏱️ Benchmark results

Benchmark Before After Speedup
rc<alg_type::std_fn. std::uint8_t> 908 ns 349 ns 2.60
rc<alg_type::std_fn. std::uint16_t> 1850 ns 462 ns 4.00
rc<alg_type::std_fn. std::uint32_t> 901 ns 532 ns 1.69
rc<alg_type::std_fn. std::uint64_t> 1876 ns 1018 ns 1.84
rc<alg_type::rng. std::uint8_t> 1344 ns 349 ns 3.85
rc<alg_type::rng. std::uint16_t> 2094 ns 465 ns 4.50
rc<alg_type::rng. std::uint32_t> 884 ns 460 ns 1.92
rc<alg_type::rng. std::uint64_t> 1884 ns 1079 ns 1.75
uc<alg_type::std_fn. std::uint8_t> 3329 ns 263 ns 12.66 🤡
uc<alg_type::std_fn. std::uint16_t> 1145 ns 342 ns 3.35
uc<alg_type::std_fn. std::uint32_t> 1144 ns 388 ns 2.95
uc<alg_type::std_fn. std::uint64_t> 1128 ns 754 ns 1.50
uc<alg_type::rng. std::uint8_t> 1111 ns 252 ns 4.41
uc<alg_type::rng. std::uint16_t> 1328 ns 331 ns 4.01
uc<alg_type::rng. std::uint32_t> 1313 ns 386 ns 3.40
uc<alg_type::rng. std::uint64_t> 1146 ns 758 ns 1.51

🥇 Results interpretation

Good improvement!

Not as good as for non-_copy counterparts though, as memcpy takes some noticeable time.

The usual codegen gremlins that cause results variation are observed for non-vectorized tight loops. I've marked the most notorious one with clown. I can't explain that anomality.

@AlexGuteniev

This comment was marked as resolved.

@StephanTLavavej StephanTLavavej added the performance Must go faster label Mar 23, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Work In Progress in STL Code Reviews Mar 23, 2025
@AlexGuteniev

This comment was marked as off-topic.

@AlexGuteniev AlexGuteniev marked this pull request as ready for review March 25, 2025 05:28
@AlexGuteniev AlexGuteniev requested a review from a team as a code owner March 25, 2025 05:28
# Conflicts:
#	benchmarks/src/unique.cpp
#	stl/inc/algorithm
#	stl/src/vector_algorithms.cpp
@StephanTLavavej
Copy link
Member

StephanTLavavej commented Apr 22, 2025

Thanks! 😻 I pushed fixes for significant bugs in the "can compare equal to value type" codepaths, and expanded the test coverage (auditing all existing codepaths, and verifying that the new tests caught the bugs). Please double-check.

5950X results:
Benchmark Before After Speedup
rc<alg_type::std_fn, std::uint8_t> 1256 ns 373 ns 3.37
rc<alg_type::std_fn, std::uint16_t> 1820 ns 440 ns 4.14
rc<alg_type::std_fn, std::uint32_t> 1221 ns 630 ns 1.94
rc<alg_type::std_fn, std::uint64_t> 1263 ns 1303 ns 0.97
rc<alg_type::rng, std::uint8_t> 1866 ns 376 ns 4.96
rc<alg_type::rng, std::uint16_t> 1248 ns 440 ns 2.84
rc<alg_type::rng, std::uint32_t> 1238 ns 625 ns 1.98
rc<alg_type::rng, std::uint64_t> 1251 ns 1338 ns 0.93
uc<alg_type::std_fn, std::uint8_t> 1193 ns 407 ns 2.93
uc<alg_type::std_fn, std::uint16_t> 1174 ns 554 ns 2.12
uc<alg_type::std_fn, std::uint32_t> 1599 ns 586 ns 2.73
uc<alg_type::std_fn, std::uint64_t> 1320 ns 1101 ns 1.20
uc<alg_type::rng, std::uint8_t> 1213 ns 407 ns 2.98
uc<alg_type::rng, std::uint16_t> 1723 ns 439 ns 3.92
uc<alg_type::rng, std::uint32_t> 1239 ns 543 ns 2.28
uc<alg_type::rng, std::uint64_t> 1242 ns 1101 ns 1.13

@StephanTLavavej StephanTLavavej removed their assignment Apr 22, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Apr 22, 2025
@AlexGuteniev
Copy link
Contributor Author

Please double-check.

All looks good.

@StephanTLavavej
Copy link
Member

/azp run STL-ASan-CI

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Apr 22, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej
Copy link
Member

I resolved a trivial adjacent-add conflict with #5352 in <algorithm>.

@StephanTLavavej StephanTLavavej merged commit 38d0c04 into microsoft:main Apr 22, 2025
39 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews Apr 22, 2025
@StephanTLavavej
Copy link
Member

Thanks for these unique speedups, removing all that execution time! 😹 🚀 🤪

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants