-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Vectorize remove_copy
for 4 and 8 byte elements
#5062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Benchmark results on a Ryzen 7840HS notebook:
|
Thanks. Apparently this is not the way to go 😿 |
Thanks @muellerj2 - I also confirm that this is a pessimization on my desktop 5950X (Zen 3):
@AlexGuteniev Do you want to rework or abandon this strategy? |
I see no rework possible. We have the following options, besides abandoning:
I expect that none of these are acceptable, but leaving the final decision to you. |
We talked about this at the weekly maintainer meeting, and although we always appreciate the vast amount of effort you've put into vectorizing the STL, and we're always sad to reject a PR, vendor-specific detection logic indeed seems to us to be a step too far. The STL has never exhibited vendor-specific behavior in the past, and signing up to monitor performance indefinitely and retuning the logic (in addition to complicating test coverage) doesn't appear to be worth the potential benefits here. |
Follow up on #4987
For now only 4 and 8 byte elements and only AVX2, so that AVX2 masks can work.
This may be doable for 1 and 2 byte elements, but will require a different approach for storing partial vector. Or may be not doable for 1 and 2 byte elements, if any approach will be slower than scalar. In any case, not trying right now to avoid too big PR.
AVX2 mask stores are slower than the usual stores, so not using this approach uniformly.
⏱️ Benchmark results
Expectedly
remmove_copy
vectorized is better than non-vectorized.Expectedly
remmove_copy
vectorized does not reach theremove
vectorized performance.As usual, some minor variations in unchanged
remove
vectorized.I'm worried about the
vmaskmov*
timings.They seem to be bad enough for AMD to turn this into a pessimization.