Skip to content

Enable thread::hardware_concurrency() to handle more than 64 processors #5459

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
May 10, 2025

Conversation

YexuanXiao
Copy link
Contributor

@YexuanXiao YexuanXiao commented May 2, 2025

@YexuanXiao YexuanXiao requested a review from a team as a code owner May 2, 2025 01:14
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews May 2, 2025
@AlexGuteniev

This comment was marked as resolved.

@YexuanXiao YexuanXiao changed the title Fix #5453 to enable handling of over 64 processors Enable hardware_concurrency to handle more than 64 processors May 2, 2025
@github-project-automation github-project-automation bot moved this from Initial Review to Work In Progress in STL Code Reviews May 2, 2025
@StephanTLavavej StephanTLavavej added the bug Something isn't working label May 2, 2025
@YexuanXiao

This comment was marked as resolved.

@AlexGuteniev

This comment was marked as resolved.

@StephanTLavavej StephanTLavavej self-assigned this May 3, 2025
@StephanTLavavej StephanTLavavej moved this from Work In Progress to Initial Review in STL Code Reviews May 3, 2025
@StephanTLavavej StephanTLavavej changed the title Enable hardware_concurrency to handle more than 64 processors Enable thread::hardware_concurrency() to handle more than 64 processors May 4, 2025
This clearly establishes a loop invariant that `buffer_ptr` is non-null, so we don't need to assert it (whereas in the lambda, the assertion was more useful).

Now we only need the type `PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX` twice, so we don't need to introduce an alias `buffer_ptr_t`.

This eliminates the (confusing to me) difference between `DWORD buffer_size` outside and `PDWORD buffer_size` inside the lambda.

We can now scope `unsigned int count = 0;` to the successful case where we need it.

In my opinion, this makes the possible function exits clearer, because they're tested in a flat sequence: we succeed, we experience a true failure, or we can't allocate.
@StephanTLavavej StephanTLavavej removed their assignment May 8, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Final Review in STL Code Reviews May 8, 2025
@StephanTLavavej
Copy link
Member

Thank you! 😻 I had to push significant changes to handle multi-socket machines but I think this will work. (We really ought to find a machine to validate this on.) I manually checked that we still handle a plain machine correctly for x64 and x86.

@StephanTLavavej
Copy link
Member

Here's a self-contained test program to compare the implementations:

C:\Temp>type meow.cpp
// Copyright (c) Microsoft Corporation.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

#include <bit>
#include <cassert>
#include <memory>
#include <new>
#include <print>
#include <thread>

#include <Windows.h>
using namespace std;

unsigned int intermediate_version() noexcept { // return number of processors
    // Most devices have only one processor group and thus have the same buffer_size
#if defined(_M_X64) || defined(_M_ARM64) // 16 bytes per group
    constexpr int stack_buffer_size = 48;
#elif defined(_M_IX86) || defined(_M_ARM) // 12 bytes per group
    constexpr int stack_buffer_size = 44;
#else
#error Unknown architecture
#endif
    alignas(SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX) unsigned char buffer[stack_buffer_size];
    using buffer_ptr_t = PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX;
    DWORD buffer_size  = stack_buffer_size;

    const auto try_query = [](buffer_ptr_t buffer_ptr, PDWORD buffer_size) noexcept {
        unsigned int count = 0;
        assert(buffer_ptr != nullptr);

        if (GetLogicalProcessorInformationEx(RelationProcessorPackage, buffer_ptr, buffer_size) == TRUE) {
            for (WORD i = 0; i != buffer_ptr->Processor.GroupCount; ++i) {
                // Mask is 8 bytes on ARM64 and X64, and 4 bytes on X86 and ARM32
                count += std::popcount(buffer_ptr->Processor.GroupMask[i].Mask);
            }
        }

        return count;
    };

    if (auto count = try_query(reinterpret_cast<buffer_ptr_t>(&buffer), &buffer_size); count != 0) {
        return count;
    }

    if (GetLastError() == ERROR_INSUFFICIENT_BUFFER) {
        std::unique_ptr<unsigned char[]> new_buffer(::new (std::nothrow) unsigned char[buffer_size]);

        if (new_buffer != nullptr) {
            return try_query(reinterpret_cast<buffer_ptr_t>(new_buffer.get()), &buffer_size);
        }
    }

    return 0;
}

unsigned int final_version() noexcept { // return number of processors
    // Most devices have only one processor group and thus have the same buffer_size.
#ifdef _WIN64
    constexpr int stack_buffer_size = 48; // 16 bytes per group
#else // ^^^ 64-bit / 32-bit vvv
    constexpr int stack_buffer_size = 44; // 12 bytes per group
#endif // ^^^ 32-bit ^^^

    alignas(SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX) unsigned char stack_buffer[stack_buffer_size];
    unsigned char* buffer_ptr = stack_buffer;
    DWORD buffer_size         = stack_buffer_size;
    std::unique_ptr<unsigned char[]> new_buffer;

    // https://learn.microsoft.com/en-us/windows/win32/api/sysinfoapi/nf-sysinfoapi-getlogicalprocessorinformationex
    // The buffer "receives a sequence of variable-sized SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX structures".
    for (;;) {
        if (GetLogicalProcessorInformationEx(RelationProcessorPackage,
                reinterpret_cast<PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX>(buffer_ptr), &buffer_size)) {
            unsigned int logical_processors = 0;

            while (buffer_size > 0) {
                // Each structure in the buffer describes a processor package (aka socket)...
                const auto structure_ptr  = reinterpret_cast<PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX>(buffer_ptr);
                const auto structure_size = structure_ptr->Size;

                // ... which contains one or more processor groups.
                for (WORD i = 0; i != structure_ptr->Processor.GroupCount; ++i) {
                    logical_processors += std::popcount(structure_ptr->Processor.GroupMask[i].Mask);
                }

                // Step forward to the next structure in the buffer.
                buffer_ptr += structure_size;
                buffer_size -= structure_size;
            }

            return logical_processors;
        }

        if (GetLastError() != ERROR_INSUFFICIENT_BUFFER) {
            return 0; // API failure
        }

        new_buffer.reset(::new (std::nothrow) unsigned char[buffer_size]);

        if (!new_buffer) {
            return 0; // allocation failure
        }

        buffer_ptr = new_buffer.get();
    }
}

int main() {
    println("System: {}-bit", sizeof(void*) * 8);
    println("hardware_concurrency(): {}", thread::hardware_concurrency());
    println("intermediate_version(): {}", intermediate_version());
    println("       final_version(): {}", final_version());
}
C:\Temp>cl /EHsc /nologo /W4 /std:c++latest /MTd /Od meow.cpp
meow.cpp

C:\Temp>meow
System: 64-bit
hardware_concurrency(): 16
intermediate_version(): 16
       final_version(): 16

@StephanTLavavej
Copy link
Member

@JoostHouben Can you check the above self-contained test program on your Threadripper? I am also trying to find a multi-socket machine, but having high-core-count validation would help.

@StephanTLavavej
Copy link
Member

Tested on a 1-socket, 112-core, 112-logical processor ARM64 machine (i.e. no simultaneous multithreading):

PS D:\Users\stl\Downloads> .\arm64_hw_concur.exe
System: 64-bit
hardware_concurrency(): 64
intermediate_version(): 112
       final_version(): 112

Which confirms that @YexuanXiao's changes and my following changes both handle the >64 logical processor scenario.

Now we just need multi-socket...

@StephanTLavavej
Copy link
Member

I was able to prepare a 4-socket VM, which I believe distributed the 112 logical processors evenly, so 28 logical processors per socket. The test case printed:

C:\Temp>meow
System: 64-bit
hardware_concurrency(): 56
intermediate_version(): 28
       final_version(): 112

This is what I expected: intermediate_version() inspected only the first socket, while final_version() gets the correct answer.

We're good to go!

@StephanTLavavej StephanTLavavej moved this from Final Review to Ready To Merge in STL Code Reviews May 8, 2025
@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews May 9, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej added a commit to StephanTLavavej/STL that referenced this pull request May 9, 2025
@StephanTLavavej
Copy link
Member

I had to push another commit to fix the ARM64EC internal build. The use of popcount() in the STL's DLL revealed that (unlike VCRuntime) we weren't linking against softintrin.lib, which is necessary to be able to use the __popcnt64 intrinsic. (Most people don't have to do anything special for this, but because the STL is a defaultlib, we link with /nodefaultlib and so we have to do everything manually.)

@StephanTLavavej StephanTLavavej merged commit 254972a into microsoft:main May 10, 2025
39 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews May 10, 2025
@StephanTLavavej
Copy link
Member

🖥️ 🦖 📈

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

<thread>: std::thread::hardware_concurrency limited to 64, even on Windows 11
4 participants