Skip to content

Parallel sort algorithm performance on Threadripper #1238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ohhmm opened this issue Aug 27, 2020 · 35 comments
Closed

Parallel sort algorithm performance on Threadripper #1238

ohhmm opened this issue Aug 27, 2020 · 35 comments
Labels
performance Must go faster wontfix This will not be worked on

Comments

@ohhmm
Copy link

ohhmm commented Aug 27, 2020

Hi,

I tried to use std::sort(std::execution::par parallel implementation using custom predicate on gigabytes of data.
Unfortunately its result gave no much benefit comparably to single-thread std::sort.
Simple custom implementation of merge-sort gave much better benefit:

		auto TotalThreads = std::thread::hardware_concurrency();

		template<class Iter, class Pred>
		void quick_merge_sort(Iter first, Iter last, Pred pred)
		{
			auto length = last - first;
			auto chunkSize = length / TotalThreads;
			auto chunks = length / chunkSize;
			if (chunkSize > 1)
			{
				auto tail = length - (chunkSize * chunks);
				auto hasTail = tail > 0;
				auto tailStart = first + chunks * chunkSize;
				{
					std::deque<std::future<void>> tasks;

					for (auto thread = 0; thread < chunks; ++thread) {
						auto chunkStart = first + thread * chunkSize;
						auto chunkEnd = chunkStart + chunkSize;
						tasks.emplace_back(std::async(std::launch::async, &std::sort<Iter, Pred>, chunkStart, chunkEnd, pred));
					}

					if (hasTail)
						std::sort(tailStart, last, pred);

					while (chunkSize < length)
					{
						auto bigChunkStart = first;
						auto bigChunkMiddle = bigChunkStart;
						auto bigChunkEnd = bigChunkStart;
						for (auto thread = 0; thread < chunks; ++thread) {
							tasks.pop_front();
							bigChunkEnd = bigChunkMiddle + chunkSize;
							if (thread % 2) {
								tasks.emplace_back(std::async(std::launch::async, &std::inplace_merge<Iter, Pred>, bigChunkStart, bigChunkMiddle, bigChunkEnd, pred));
								bigChunkStart = bigChunkEnd;
							}
							bigChunkMiddle += chunkSize;
						}

						chunkSize *= 2;
						chunks = length / chunkSize;
					}
				} // ensure tasks completion

				if (hasTail) {
					std::inplace_merge(first, tailStart, last, pred);
				}
			}
			else if (length > 1) {
				std::sort(first, last, pred);
			}

			assert(std::is_sorted(first, last, pred));
		}
	}

This code still has range-way to improve CPU threads utilization, but it gave much more benefit of parallelization.
How much threads are used to sort with parallel executor?

@ohhmm ohhmm added the question Further information is requested label Aug 27, 2020
@cbezault
Copy link
Contributor

@mnatsuhara

@fsb4000
Copy link
Contributor

fsb4000 commented Aug 28, 2020

Simple custom implementation of merge-sort gave much better benefit:

Did you try std::stable_sort with execution::par?

Implementation of std::sort with parallel executor:

template <class _ExPo, class _RanIt, class _Pr, _Enable_if_execution_policy_t<_ExPo> /* = 0 */>

Function that determines the number of threads:

_NODISCARD unsigned int __stdcall __std_parallel_algorithms_hw_threads() noexcept {

@StephanTLavavej StephanTLavavej added the performance Must go faster label Aug 28, 2020
@StephanTLavavej StephanTLavavej changed the title Parallel sort algoritm at Threadripper Parallel sort algorithm performance on Threadripper Aug 28, 2020
@StephanTLavavej
Copy link
Member

Can you provide a complete test case demonstrating the performance issue? (With something that generates the data, of course; we don't care about the data or even about what the predicate does, but we need to have an example of both of them to understand how they influence the implementation.)

@StephanTLavavej StephanTLavavej added the info needed We need more info before working on this label Aug 28, 2020
@ohhmm
Copy link
Author

ohhmm commented Aug 28, 2020

Hi @StephanTLavavej,

That would be much extra effort.
Do you really need it?
I was hoping you can see the difference using own tests.

Thanks,
Serg

@ohhmm
Copy link
Author

ohhmm commented Aug 28, 2020

I could launch some tests on my local environment for gathering some statistics if it helps?

@MikeGitb
Copy link

Can't speak for STL, but fixing a performance problem without having some representative test case along with compilation flags and stuff can be extremely difficult when there isn't something really obvious going on.

@StephanTLavavej
Copy link
Member

I agree with what @MikeGitb said. We ran performance tests when we first implemented these algorithms, to tune their implementations and to verify that they were worth parallelizing at all - if we had observed what you're observing, we would have addressed the issue. So yes, we do need a test case.

@fsb4000
Copy link
Contributor

fsb4000 commented Aug 29, 2020

I could launch some tests on my local environment for gathering some statistics if it helps?

Could you run the code with your build options:

#include <iostream>
#include <algorithm>
#include <execution>

int main()
{
  std::cout << __std_parallel_algorithms_hw_threads() << '\n';
}

and could you run your code with std::stable_sort with execution::par instead of std::sort?

@ohhmm
Copy link
Author

ohhmm commented Aug 29, 2020

Ok, I can do that. I'll let you know.

@ohhmm
Copy link
Author

ohhmm commented Aug 29, 2020

std::deque of 1513002 272-byte records
__std_parallel_algorithms_hw_threads: 64
std::stable_sort
3131872300 ns
quick_merge_sort
726697900 ns
std::sort
3355967900 ns
std::stable_sort(std::execution::par, ...
598296600 ns
std::sort(std::execution::par, ...
277424400 ns

@ohhmm
Copy link
Author

ohhmm commented Aug 29, 2020

Wait, wrong data. That was half-minute. I'll find that sample. The measurement is for release configuration.

@ohhmm
Copy link
Author

ohhmm commented Aug 29, 2020

Here.

std::deque of 1417301 272-byte records
__std_parallel_algorithms_hw_threads: 64
std::stable_sort
2051602100 ns
quick_merge_sort
109611500 ns
std::sort
489360300 ns
std::stable_sort(std::execution::par, ...
508801200 ns
std::sort(std::execution::par, ...
107548100 ns

sort/sort-par = 489360300 / 107548100 = 4.55 which is a bit far from 64, but it should be near to 128.
Let me try to improve utilization to handling merging ranges.
I'll let you know.

@ohhmm
Copy link
Author

ohhmm commented Aug 30, 2020

Ok, I did that:

	auto TotalThreads = std::thread::hardware_concurrency();

	template<class Iter, class Pred>
	void quick_merge_sort(Iter first, Iter last, Pred pred)
	{
		auto length = last - first;
		auto chunkSize = length / TotalThreads;
		auto chunks = length / chunkSize;
		if (chunkSize > 1)
		{
			auto tail = length - (chunkSize * chunks);
			auto hasTail = tail > 0;
			auto tailStart = first + chunks * chunkSize;
			{
				std::list<
					std::pair<
						std::future<void>,
						std::pair<
							Iter, // task chunk start
							Iter> // task chunk end
						>
					> tasks;
				
				for (auto thread = 0; thread < chunks; ++thread) {
					auto chunkStart = first + thread * chunkSize;
					auto chunkEnd = chunkStart + chunkSize;
					tasks.emplace_back(
						std::async(std::launch::async, &std::sort<Iter, Pred>, chunkStart, chunkEnd, pred),
						std::make_pair(chunkStart, chunkEnd)
						);
				}

				if (hasTail)
					tasks.emplace_back(
						std::async(std::launch::async, &std::sort<Iter, Pred>, tailStart, last, pred),
						std::make_pair(tailStart, last)
					);

				while (tasks.size()>1)
				{
					bool ready = {};
					auto it = tasks.begin(), prev = it;
					auto e = tasks.end();
					auto nonReadyYet = true;
					while (it != e) {
						auto thisTaskReady = it->first.valid()
							&& it->first.wait_for(std::chrono::nanoseconds()) == std::future_status::ready;
						if (thisTaskReady && ready) {
							prev->first = std::async(std::launch::async, &std::inplace_merge<Iter, Pred>, prev->second.first, prev->second.second, it->second.second, pred);
							prev->second.second = it->second.second;
							it = tasks.erase(it);
							nonReadyYet = ready = {};
						}
						else {
							prev = it++;
							ready = thisTaskReady;
						}
					}
					if (nonReadyYet && tasks.size() > 3) {
						std::this_thread::yield();
					}
				}
			}
		}
		else if (length > 1) {
			std::sort(first, last, pred);
		}

		assert(std::is_sorted(first, last, pred));
	}
}

std::deque of 1417301 272-byte records
__std_parallel_algorithms_hw_threads: 64
quick_merge_sort
76938400 ns
std::stable_sort
2033208700 ns
std::sort
485405800 ns
std::stable_sort(std::execution::par, ...
510451700 ns
std::sort(std::execution::par, ...
109773900 ns

Now large data...

@ohhmm
Copy link
Author

ohhmm commented Aug 30, 2020

std::deque of 33587200 272-byte records
__std_parallel_algorithms_hw_threads: 64
quick_merge_sort
55331423400 ns
std::stable_sort(std::execution::par, ...
145887763600 ns
std::sort(std::execution::par, ...
18138303100 ns

@ohhmm
Copy link
Author

ohhmm commented Aug 30, 2020

That is all. I hope this helps.

@cbezault cbezault removed info needed We need more info before working on this question Further information is requested labels Sep 2, 2020
@BillyONeal
Copy link
Member

sort/sort-par = 489360300 / 107548100 = 4.55 which is a bit far from 64, but it should be near to 128

The current implementation is quicksort, and we effectively "fork" to recruse into each partition, so our maximum fanout is proportional to lg n (assuming we get good pivots). Sort is not an "embarrassingly parallel" problem; one should not expect a 64x or 128x speedup. Libraries that report huge speedups (e.g. Thrust) tend to report using integers, where they implement a radix sort, which is more parallelizable.

Your algorithm is to static partition the input, plain sort each partition, then bottom-up inplace_merge the partitions together. In that sense it is similar to how our stable_sort works, except our stable_sort always uses initial partitions of 16 elements we insertion sort, while your algorithm starts with partitions of N/T elements initially std::sorted.

I monospaced your results and lined up the least significant digits to make it easier to compare:

std::deque of 1417301 272-byte records
__std_parallel_algorithms_hw_threads: 64
quick_merge_sort
  76938400 ns
std::stable_sort
2033208700 ns
std::sort
 485405800 ns
std::stable_sort(std::execution::par, ...
 510451700 ns
std::sort(std::execution::par, ...
 109773900 ns

Did I misread this? It says that our serial std::sort beats the above parallel algorithm by ~2x and sort(par wins by around 8x. That would suggest that our current partition-and-fork strategy is better than the static-partition-and-bottom-up-merge strategy proposed.

std::deque of 33587200 272-byte records
__std_parallel_algorithms_hw_threads: 64
quick_merge_sort
 55331423400 ns
std::stable_sort(std::execution::par, ...
145887763600 ns
std::sort(std::execution::par, ...
 18138303100 ns

This still has sort(par winning by ~5x.

@MikeGitb
Copy link

MikeGitb commented Sep 6, 2020

@BillyONeal : I think you made a mistake when counting the number of digits. The proposed quick merge is about 77ms (not 770) and parallel std::sort is about 109 ms:

quick_merge_sort       76938400 ns 
std::stable_sort     2033208700 ns 
std::sort.            485405800 ns
std::stable_sort(par) 510451700 ns 
std::sort(par)        109773900 ns

@BillyONeal
Copy link
Member

BillyONeal commented Sep 6, 2020

@MikeGitb Ah, yes. With the 1417301 records one I'm off by a digit so it's a bit faster instead of 1/8th as fast, but with the 33587200 records one I'm not.

The proposed alternate strategy is interesting in that it would let us get rid of the work stealing infrastructure as sort is the only algorithm that uses it.

I'm still not super convinced making the trade of additional data movement in exchange for better fanout, as this does, is a worthwhile one. Particularly because it will lose badly when you have ~4 cores instead of 60.

Thinking about this again I think a bigger problem with the proposed algorithm is that it makes the sort nondeterministic. That is, equivalent keys will end up in different orders if you sort them on machines with different numbers of cores. It might be worth doing that if we got a huge speedup over it, but IMO we don't have enough evidence presented here of it.

I suppose the decision here ultimately lands on the current maintainers, most probably @mnatsuhara

@BillyONeal
Copy link
Member

(I also note that the proposed algorithm could be faster if we applied the better constant factors of the way our parallel stable_sort works; say we might cut element movement in half, so it would be better if we productized it. So the comparison of our current fairly optimized implementation against this 'demonstrate another way' implementation might not be entirely fair)

@mnatsuhara
Copy link
Contributor

@BillyONeal Thanks for the background and analysis!

Given that the benchmarks show a modest performance improvement from the alternate approach in one case and our current implementation still wins in the case with bigger data, I'm not inclined to make changes to the current approach. Any change carries some level of risk and overhead, and the bar for making changes for improved performance (rather than addressing a correctness issue) is high.

Thank you @ohhmm for bringing this up and building up the test cases!

@ohhmm
Copy link
Author

ohhmm commented Sep 20, 2020

Thank you Everyone,

The most important thing to me here in this context is that we have 64 threads working instead of 128 hardware available threads.

Thanks,
Serg

@StephanTLavavej StephanTLavavej added the wontfix This will not be worked on label Sep 21, 2020
@StephanTLavavej
Copy link
Member

According to my understanding, the 64-thread limitation is due to the Windows threadpool that we're using - special Windows APIs are required to access groups of more than 64 threads at a time.

@BillyONeal
Copy link
Member

Note that this 64 limitation is also the limitation of all our components, including std::thread (which models CreateThread), and std::async (which models TP_WORK); these all only create threads in the process' default cpu group. This is an operating system limitation; targeting more than 64 hardware threads requires explicitly creating threads according to the hardware topology in the thread groups you desire with CreateRemoteThreadEx (and those threads are scheduled only in those respective groups). We want to use the system thread pool TP_WORK in the parallel algorithms because we value the thread pool's special deal with the kernel over the ability to target more than 64 hardware threads. Particularly for an algorithm like sort that typically only achieves a fanout of ~6.

@ohhmm
Copy link
Author

ohhmm commented Sep 21, 2020

Hi @StephanTLavavej & @BillyONeal

How to sort with this STL using par executioner with >64 cores?

Thanks,
Serg

@ohhmm
Copy link
Author

ohhmm commented Sep 21, 2020

Thanks @BillyONeal

I watched the video and I do agree that std::async definitely had to use OS thread pool for Windows implementations.

Thanks,
Serg

@BillyONeal
Copy link
Member

How to sort with this STL using par executioner with >64 cores?

There is no way to do that at present time. Even if there was a standard mechanism to ask for it, we don't have a sort algorithm which can effectively scale to that. Our current algorithm's fanout is limited by the number of partitions generated by quicksort. Your proposed algorithm generates more partitions but the running time becomes dominated by the merge steps for which a truly parallel solution is not known.

@ohhmm
Copy link
Author

ohhmm commented Dec 4, 2020

My key notes:

  • I used only 64 threads in quick_merge_sort too which is only half of hardware, @mnatsuhara
  • The merge steps are easily parallelable with GPU, @BillyONeal
  • std::async may be reimplemented to make use of all the 128 thread using the OS thread pool or DPC queue, @StephanTLavavej

Regards,
Serg

@sylveon
Copy link
Contributor

sylveon commented Dec 4, 2020

Unless I'm mistaken, the OS thread pool is not process group aware either, for backwards compatibility concerns.

GPU offloading is a bit too complex for the STL considering the vast range of hardware it must support.

@ohhmm
Copy link
Author

ohhmm commented Dec 4, 2020

@sylveon,
It is trivial to use 128 threads and OpenCL.

@BillyONeal
Copy link
Member

I used only 64 threads in quick_merge_sort too which is only half of hardware,

I think you missed the point that your quick_merge_sort was no faster than the implementation we are shipping. Sort is not an "embarrassingly parallel" problem and you should not expect perfect scaling with cores.

The merge steps are easily parallelable with GPU, @BillyONeal

We don't get to assume that the system has a programmable GPU and no implementation we presently target targets a GPU.

std::async may be reimplemented to make use of all the 128 thread using the OS thread pool or DPC queue, @StephanTLavavej

The OS thread pool is similarly limited to one CPU group (this is in fact why the parallel algorithms library is limited to 64 logical CPUs, because we use the OS thread pool).

@sylveon,
It is trivial to use 128 threads and OpenCL.

We don't target any implementations of OpenCL.

@ohhmm
Copy link
Author

ohhmm commented Dec 15, 2020

It is still trivial tasks.
It is trivial to use 128 hardware threads.
There's no good reason to ignore this fact.

@BillyONeal
Copy link
Member

It is still trivial tasks.

Parallel sort is not "trivial", and your implementation is not faster than ours despite using more threads.

It is trivial to use 128 hardware threads.

If someone supplies a parallel sort algorithm which can usably scale to 128 threads, we might have to cross that line. At this time no such algorithm presently exists.

There's no good reason to ignore this fact.

I believe "the algorithm you're asking for does not exist" is plenty reason to "ignore" this fact.

@ohhmm
Copy link
Author

ohhmm commented Dec 15, 2020

Hi @BillyONeal,

First of all I think I owed you an apology.
And my implementation used same thread count as yours. It shows better only if I use all CPU groups by setting process affinity.

This thread is not about sorting implementation. The question is for any good reason to not use hardware threads for sorting.
I bet there may not be a good reason to use only half or less for 8096-thread CPU for sorting.

I provided algorithm of merge sort because every CS courses tells us that the merge sort is the best parallelable and yet it has the best asymptotic comparable to heap sort algorithm only. But the heap sort is not as good for paralleling.

Please, let me tell you about my expert expectations from the first the post at this thread:

  • Schedule a meeting to decide on adding merge_sort for next proposal.
  • Optimize the implementation to find asymptotic-guaranteed better performance cases.
  • Create tickets to proper team to allow std::async and its basis std::thread to utilize CPU fully as Linux does in beneficial scenarios.
  • Schedule a meeting to decide on GPU offloading future in standard. Decide on offloadable comparators for compatible data types.
  • Do all necessary breakdowns at some of next milestones
  • Thank me in this thread and enlist all activities which made thanks to my efforts.

Should I show the list of what I got instead? I do not want to embarrass anyone. Only would tell about notes that its hard to do trivial part of the SDP.

I think your corporate culture should encourage you to rebuild communications in more efficient and collaborative way.

Meanwhile your sorting used to make microarray chips for people who struggle for your lives.

Regards,
Serg

@BillyONeal
Copy link
Member

It shows better only if I use all CPU groups by setting process affinity.

Your std::async based solution is limited to 64 threads; changing process affinity does not change that. And when I said your algorithm was not better, I used your numbers.

The question is for any good reason to not use hardware threads for sorting.

Using hardware threads is not a good thing; in fact it is a bad thing. Using more threads to get the result in about the same speed is a worse algorithm. Getting the result faster is the goal.

every CS courses tells us that the merge sort is the best parallelable

And yet our quicksort-with-work-stealing implementation wins, by your own numbers, using fewer hardware resources.

Merge sort indeed gets you perfect splitting of the work items, but the problem is that the merge step is not parallelizable, which means your maximum fanout is lg N threads, and towards the end of the algorithm you aren't parallel anymore. This is no different than our quicksort implementation, where at the beginning of the algorithm fanout is bad because the first partition op isn't parallelizable. They both have a best case of using lg N threads.

Schedule a meeting to decide on adding merge_sort for next proposal.

Our parallel stable_sort is already a merge sort, if you want that.

Create tickets to proper team to allow std::async and its basis std::thread to utilize CPU fully as Linux does in beneficial scenarios.

This cannot happen, it is a change to the operating system and nothing the STL can do can change that. You need to talk to Windows if that's something you care about.

Schedule a meeting to decide on GPU offloading future in standard.

The STL team is not going to push for GPU offloading; there are people who sell GPUs who actually care about that. GPU offloading does not buy you anything in this case anyway. Thrust achieves higher sort performance, but only for types where they can engage a radix sort.

@ohhmm
Copy link
Author

ohhmm commented Dec 27, 2020

Merry Christmas @BillyONeal,

And thank you for your reply.
My comments inlined:

It shows better only if I use all CPU groups by setting process affinity.

Your std::async based solution is limited to 64 threads; changing process affinity does not change that. And when I said your algorithm was not better, I used your numbers.

That single result which beats was with two extra things:

  • it had hard-coded 128 threads
  • I enabled all process group cores using task manager
    Only this way I obtained 100% CPU load.

The question is for any good reason to not use hardware threads for sorting.

Using hardware threads is not a good thing; in fact it is a bad thing. Using more threads to get the result in about the same speed is a worse algorithm. Getting the result faster is the goal.

The thing is that CS academic knowledge says us that the parallel merge sorting is the fastest.
Please, consider this question:
Was all process groups cores on with all hardware threads on?

every CS courses tells us that the merge sort is the best parallelable

And yet our quicksort-with-work-stealing implementation wins, by your own numbers, using fewer hardware resources.

I guess that with fewer resources it would win all the cases. But the idea was that you guys check it. Would you try both with all hardware threads involved or it is hard to get equipment? I could run some tests if you provide me with an oneliner.

Merge sort indeed gets you perfect splitting of the work items, but the problem is that the merge step is not parallelizable, which means your maximum fanout is lg N threads, and towards the end of the algorithm you aren't parallel anymore. This is no different than our quicksort implementation, where at the beginning of the algorithm fanout is bad because the first partition op isn't parallelizable. They both have a best case of using lg N threads.

What can I say:
Here is good comparison table, these algorithms are on top
Merge sort has better worst-case, it is memory-stable and yet is the best paralleled after all.

Schedule a meeting to decide on adding merge_sort for next proposal.

Our parallel stable_sort is already a merge sort, if you want that.

Thanks, as far as I remember it was slower then std::sort in all scenarios.

Create tickets to proper team to allow std::async and its basis std::thread to utilize CPU fully as Linux does in beneficial scenarios.

This cannot happen, it is a change to the operating system and nothing the STL can do can change that. You need to talk to Windows if that's something you care about.

What if they can do it on the basis of my observation of the sorting with 100% CPU load on the 128-thread CPU?
I observed nothing wrong during this experiment.
There should be some person responsible for this decision.

Schedule a meeting to decide on GPU offloading future in standard.

The STL team is not going to push for GPU offloading; there are people who sell GPUs who actually care about that. GPU offloading does not buy you anything in this case anyway. Thrust achieves higher sort performance, but only for types where they can engage a radix sort.

Actually, I meant a little bit different feature.
Lets call it built-in serialization of default-generated comparators into OpenCL kernel-function.
I'm not highly interested to push these things too.
It is just a feedback.
I see things this way.

Thanks again Bill,
Do not bother much on this minor issues.
Have a nice vacations Everyone.

Good luck,
Serg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

8 participants