-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Parallel sort algorithm performance on Threadripper #1238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Did you try Implementation of std::sort with parallel executor: Line 2749 in 392fb6d
Function that determines the number of threads: STL/stl/src/parallel_algorithms.cpp Line 26 in 392fb6d
|
Can you provide a complete test case demonstrating the performance issue? (With something that generates the data, of course; we don't care about the data or even about what the predicate does, but we need to have an example of both of them to understand how they influence the implementation.) |
Hi @StephanTLavavej, That would be much extra effort. Thanks, |
I could launch some tests on my local environment for gathering some statistics if it helps? |
Can't speak for STL, but fixing a performance problem without having some representative test case along with compilation flags and stuff can be extremely difficult when there isn't something really obvious going on. |
I agree with what @MikeGitb said. We ran performance tests when we first implemented these algorithms, to tune their implementations and to verify that they were worth parallelizing at all - if we had observed what you're observing, we would have addressed the issue. So yes, we do need a test case. |
Could you run the code with your build options:
and could you run your code with |
Ok, I can do that. I'll let you know. |
std::deque of 1513002 272-byte records |
Wait, wrong data. That was half-minute. I'll find that sample. The measurement is for release configuration. |
Here. std::deque of 1417301 272-byte records sort/sort-par = 489360300 / 107548100 = 4.55 which is a bit far from 64, but it should be near to 128. |
Ok, I did that:
std::deque of 1417301 272-byte records Now large data... |
std::deque of 33587200 272-byte records |
That is all. I hope this helps. |
The current implementation is quicksort, and we effectively "fork" to recruse into each partition, so our maximum fanout is proportional to lg n (assuming we get good pivots). Sort is not an "embarrassingly parallel" problem; one should not expect a 64x or 128x speedup. Libraries that report huge speedups (e.g. Thrust) tend to report using integers, where they implement a radix sort, which is more parallelizable. Your algorithm is to static partition the input, plain sort each partition, then bottom-up I monospaced your results and lined up the least significant digits to make it easier to compare:
Did I misread this? It says that our serial std::sort beats the above parallel algorithm by ~2x and
This still has sort(par winning by ~5x. |
@BillyONeal : I think you made a mistake when counting the number of digits. The proposed quick merge is about 77ms (not 770) and parallel std::sort is about 109 ms:
|
@MikeGitb Ah, yes. With the 1417301 records one I'm off by a digit so it's a bit faster instead of 1/8th as fast, but with the 33587200 records one I'm not. The proposed alternate strategy is interesting in that it would let us get rid of the work stealing infrastructure as sort is the only algorithm that uses it. I'm still not super convinced making the trade of additional data movement in exchange for better fanout, as this does, is a worthwhile one. Particularly because it will lose badly when you have ~4 cores instead of 60. Thinking about this again I think a bigger problem with the proposed algorithm is that it makes the sort nondeterministic. That is, equivalent keys will end up in different orders if you sort them on machines with different numbers of cores. It might be worth doing that if we got a huge speedup over it, but IMO we don't have enough evidence presented here of it. I suppose the decision here ultimately lands on the current maintainers, most probably @mnatsuhara |
(I also note that the proposed algorithm could be faster if we applied the better constant factors of the way our parallel stable_sort works; say we might cut element movement in half, so it would be better if we productized it. So the comparison of our current fairly optimized implementation against this 'demonstrate another way' implementation might not be entirely fair) |
@BillyONeal Thanks for the background and analysis! Given that the benchmarks show a modest performance improvement from the alternate approach in one case and our current implementation still wins in the case with bigger data, I'm not inclined to make changes to the current approach. Any change carries some level of risk and overhead, and the bar for making changes for improved performance (rather than addressing a correctness issue) is high. Thank you @ohhmm for bringing this up and building up the test cases! |
Thank you Everyone, The most important thing to me here in this context is that we have 64 threads working instead of 128 hardware available threads. Thanks, |
According to my understanding, the 64-thread limitation is due to the Windows threadpool that we're using - special Windows APIs are required to access groups of more than 64 threads at a time. |
Note that this 64 limitation is also the limitation of all our components, including |
Hi @StephanTLavavej & @BillyONeal How to sort with this STL using par executioner with >64 cores? Thanks, |
Thanks @BillyONeal I watched the video and I do agree that std::async definitely had to use OS thread pool for Windows implementations. Thanks, |
There is no way to do that at present time. Even if there was a standard mechanism to ask for it, we don't have a sort algorithm which can effectively scale to that. Our current algorithm's fanout is limited by the number of partitions generated by quicksort. Your proposed algorithm generates more partitions but the running time becomes dominated by the merge steps for which a truly parallel solution is not known. |
My key notes:
Regards, |
Unless I'm mistaken, the OS thread pool is not process group aware either, for backwards compatibility concerns. GPU offloading is a bit too complex for the STL considering the vast range of hardware it must support. |
@sylveon, |
I think you missed the point that your quick_merge_sort was no faster than the implementation we are shipping. Sort is not an "embarrassingly parallel" problem and you should not expect perfect scaling with cores.
We don't get to assume that the system has a programmable GPU and no implementation we presently target targets a GPU.
The OS thread pool is similarly limited to one CPU group (this is in fact why the parallel algorithms library is limited to 64 logical CPUs, because we use the OS thread pool).
We don't target any implementations of OpenCL. |
It is still trivial tasks. |
Parallel sort is not "trivial", and your implementation is not faster than ours despite using more threads.
If someone supplies a parallel sort algorithm which can usably scale to 128 threads, we might have to cross that line. At this time no such algorithm presently exists.
I believe "the algorithm you're asking for does not exist" is plenty reason to "ignore" this fact. |
Hi @BillyONeal, First of all I think I owed you an apology. This thread is not about sorting implementation. The question is for any good reason to not use hardware threads for sorting. I provided algorithm of merge sort because every CS courses tells us that the merge sort is the best parallelable and yet it has the best asymptotic comparable to heap sort algorithm only. But the heap sort is not as good for paralleling. Please, let me tell you about my expert expectations from the first the post at this thread:
Should I show the list of what I got instead? I do not want to embarrass anyone. Only would tell about notes that its hard to do trivial part of the SDP. I think your corporate culture should encourage you to rebuild communications in more efficient and collaborative way. Meanwhile your sorting used to make microarray chips for people who struggle for your lives. Regards, |
Your std::async based solution is limited to 64 threads; changing process affinity does not change that. And when I said your algorithm was not better, I used your numbers.
Using hardware threads is not a good thing; in fact it is a bad thing. Using more threads to get the result in about the same speed is a worse algorithm. Getting the result faster is the goal.
And yet our quicksort-with-work-stealing implementation wins, by your own numbers, using fewer hardware resources. Merge sort indeed gets you perfect splitting of the work items, but the problem is that the merge step is not parallelizable, which means your maximum fanout is lg N threads, and towards the end of the algorithm you aren't parallel anymore. This is no different than our quicksort implementation, where at the beginning of the algorithm fanout is bad because the first partition op isn't parallelizable. They both have a best case of using lg N threads.
Our parallel stable_sort is already a merge sort, if you want that.
This cannot happen, it is a change to the operating system and nothing the STL can do can change that. You need to talk to Windows if that's something you care about.
The STL team is not going to push for GPU offloading; there are people who sell GPUs who actually care about that. GPU offloading does not buy you anything in this case anyway. Thrust achieves higher sort performance, but only for types where they can engage a radix sort. |
Merry Christmas @BillyONeal, And thank you for your reply.
That single result which beats was with two extra things:
The thing is that CS academic knowledge says us that the parallel merge sorting is the fastest.
I guess that with fewer resources it would win all the cases. But the idea was that you guys check it. Would you try both with all hardware threads involved or it is hard to get equipment? I could run some tests if you provide me with an oneliner.
What can I say:
Thanks, as far as I remember it was slower then std::sort in all scenarios.
What if they can do it on the basis of my observation of the sorting with 100% CPU load on the 128-thread CPU?
Actually, I meant a little bit different feature. Thanks again Bill, Good luck, |
Hi,
I tried to use
std::sort(std::execution::par
parallel implementation using custom predicate on gigabytes of data.Unfortunately its result gave no much benefit comparably to single-thread
std::sort
.Simple custom implementation of merge-sort gave much better benefit:
This code still has range-way to improve CPU threads utilization, but it gave much more benefit of parallelization.
How much threads are used to sort with parallel executor?
The text was updated successfully, but these errors were encountered: