-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Enable thread::hardware_concurrency()
to handle more than 64 processors
#5459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[`GetActiveProcessorCount`](https://learn.microsoft.com/zh-cn/windows/win32/api/winbase/nf-winbase-getactiveprocessorcount) [Raymond Chen's claim](https://devblogs.microsoft.com/oldnewthing/20200824-00/?p=104116)
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
thread::hardware_concurrency()
to handle more than 64 processors
This clearly establishes a loop invariant that `buffer_ptr` is non-null, so we don't need to assert it (whereas in the lambda, the assertion was more useful). Now we only need the type `PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX` twice, so we don't need to introduce an alias `buffer_ptr_t`. This eliminates the (confusing to me) difference between `DWORD buffer_size` outside and `PDWORD buffer_size` inside the lambda. We can now scope `unsigned int count = 0;` to the successful case where we need it. In my opinion, this makes the possible function exits clearer, because they're tested in a flat sequence: we succeed, we experience a true failure, or we can't allocate.
Thank you! 😻 I had to push significant changes to handle multi-socket machines but I think this will work. (We really ought to find a machine to validate this on.) I manually checked that we still handle a plain machine correctly for x64 and x86. |
Here's a self-contained test program to compare the implementations:
// Copyright (c) Microsoft Corporation.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
#include <bit>
#include <cassert>
#include <memory>
#include <new>
#include <print>
#include <thread>
#include <Windows.h>
using namespace std;
unsigned int intermediate_version() noexcept { // return number of processors
// Most devices have only one processor group and thus have the same buffer_size
#if defined(_M_X64) || defined(_M_ARM64) // 16 bytes per group
constexpr int stack_buffer_size = 48;
#elif defined(_M_IX86) || defined(_M_ARM) // 12 bytes per group
constexpr int stack_buffer_size = 44;
#else
#error Unknown architecture
#endif
alignas(SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX) unsigned char buffer[stack_buffer_size];
using buffer_ptr_t = PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX;
DWORD buffer_size = stack_buffer_size;
const auto try_query = [](buffer_ptr_t buffer_ptr, PDWORD buffer_size) noexcept {
unsigned int count = 0;
assert(buffer_ptr != nullptr);
if (GetLogicalProcessorInformationEx(RelationProcessorPackage, buffer_ptr, buffer_size) == TRUE) {
for (WORD i = 0; i != buffer_ptr->Processor.GroupCount; ++i) {
// Mask is 8 bytes on ARM64 and X64, and 4 bytes on X86 and ARM32
count += std::popcount(buffer_ptr->Processor.GroupMask[i].Mask);
}
}
return count;
};
if (auto count = try_query(reinterpret_cast<buffer_ptr_t>(&buffer), &buffer_size); count != 0) {
return count;
}
if (GetLastError() == ERROR_INSUFFICIENT_BUFFER) {
std::unique_ptr<unsigned char[]> new_buffer(::new (std::nothrow) unsigned char[buffer_size]);
if (new_buffer != nullptr) {
return try_query(reinterpret_cast<buffer_ptr_t>(new_buffer.get()), &buffer_size);
}
}
return 0;
}
unsigned int final_version() noexcept { // return number of processors
// Most devices have only one processor group and thus have the same buffer_size.
#ifdef _WIN64
constexpr int stack_buffer_size = 48; // 16 bytes per group
#else // ^^^ 64-bit / 32-bit vvv
constexpr int stack_buffer_size = 44; // 12 bytes per group
#endif // ^^^ 32-bit ^^^
alignas(SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX) unsigned char stack_buffer[stack_buffer_size];
unsigned char* buffer_ptr = stack_buffer;
DWORD buffer_size = stack_buffer_size;
std::unique_ptr<unsigned char[]> new_buffer;
// https://learn.microsoft.com/en-us/windows/win32/api/sysinfoapi/nf-sysinfoapi-getlogicalprocessorinformationex
// The buffer "receives a sequence of variable-sized SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX structures".
for (;;) {
if (GetLogicalProcessorInformationEx(RelationProcessorPackage,
reinterpret_cast<PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX>(buffer_ptr), &buffer_size)) {
unsigned int logical_processors = 0;
while (buffer_size > 0) {
// Each structure in the buffer describes a processor package (aka socket)...
const auto structure_ptr = reinterpret_cast<PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX>(buffer_ptr);
const auto structure_size = structure_ptr->Size;
// ... which contains one or more processor groups.
for (WORD i = 0; i != structure_ptr->Processor.GroupCount; ++i) {
logical_processors += std::popcount(structure_ptr->Processor.GroupMask[i].Mask);
}
// Step forward to the next structure in the buffer.
buffer_ptr += structure_size;
buffer_size -= structure_size;
}
return logical_processors;
}
if (GetLastError() != ERROR_INSUFFICIENT_BUFFER) {
return 0; // API failure
}
new_buffer.reset(::new (std::nothrow) unsigned char[buffer_size]);
if (!new_buffer) {
return 0; // allocation failure
}
buffer_ptr = new_buffer.get();
}
}
int main() {
println("System: {}-bit", sizeof(void*) * 8);
println("hardware_concurrency(): {}", thread::hardware_concurrency());
println("intermediate_version(): {}", intermediate_version());
println(" final_version(): {}", final_version());
}
|
@JoostHouben Can you check the above self-contained test program on your Threadripper? I am also trying to find a multi-socket machine, but having high-core-count validation would help. |
Tested on a 1-socket, 112-core, 112-logical processor ARM64 machine (i.e. no simultaneous multithreading):
Which confirms that @YexuanXiao's changes and my following changes both handle the >64 logical processor scenario. Now we just need multi-socket... |
I was able to prepare a 4-socket VM, which I believe distributed the 112 logical processors evenly, so 28 logical processors per socket. The test case printed:
This is what I expected: We're good to go! |
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
I had to push another commit to fix the ARM64EC internal build. The use of |
🖥️ 🦖 📈 |
Microsoft Learn
GetLogicalProcessorInformationEx
Fix #5453.