Enable `thread::hardware_concurrency()` to handle more than 64 processors #5459

YexuanXiao · 2025-05-02T01:14:12Z

Microsoft Learn GetLogicalProcessorInformationEx
Fix #5453.

[`GetActiveProcessorCount`](https://learn.microsoft.com/zh-cn/windows/win32/api/winbase/nf-winbase-getactiveprocessorcount) [Raymond Chen's claim](https://devblogs.microsoft.com/oldnewthing/20200824-00/?p=104116)

stl/src/cthread.cpp

This clearly establishes a loop invariant that `buffer_ptr` is non-null, so we don't need to assert it (whereas in the lambda, the assertion was more useful). Now we only need the type `PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX` twice, so we don't need to introduce an alias `buffer_ptr_t`. This eliminates the (confusing to me) difference between `DWORD buffer_size` outside and `PDWORD buffer_size` inside the lambda. We can now scope `unsigned int count = 0;` to the successful case where we need it. In my opinion, this makes the possible function exits clearer, because they're tested in a flat sequence: we succeed, we experience a true failure, or we can't allocate.

stl/src/cthread.cpp

StephanTLavavej · 2025-05-08T12:16:03Z

Thank you! 😻 I had to push significant changes to handle multi-socket machines but I think this will work. (We really ought to find a machine to validate this on.) I manually checked that we still handle a plain machine correctly for x64 and x86.

StephanTLavavej · 2025-05-08T12:28:35Z

Here's a self-contained test program to compare the implementations:

C:\Temp>type meow.cpp

// Copyright (c) Microsoft Corporation.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

#include <bit>
#include <cassert>
#include <memory>
#include <new>
#include <print>
#include <thread>

#include <Windows.h>
using namespace std;

unsigned int intermediate_version() noexcept { // return number of processors
    // Most devices have only one processor group and thus have the same buffer_size
#if defined(_M_X64) || defined(_M_ARM64) // 16 bytes per group
    constexpr int stack_buffer_size = 48;
#elif defined(_M_IX86) || defined(_M_ARM) // 12 bytes per group
    constexpr int stack_buffer_size = 44;
#else
#error Unknown architecture
#endif
    alignas(SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX) unsigned char buffer[stack_buffer_size];
    using buffer_ptr_t = PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX;
    DWORD buffer_size  = stack_buffer_size;

    const auto try_query = [](buffer_ptr_t buffer_ptr, PDWORD buffer_size) noexcept {
        unsigned int count = 0;
        assert(buffer_ptr != nullptr);

        if (GetLogicalProcessorInformationEx(RelationProcessorPackage, buffer_ptr, buffer_size) == TRUE) {
            for (WORD i = 0; i != buffer_ptr->Processor.GroupCount; ++i) {
                // Mask is 8 bytes on ARM64 and X64, and 4 bytes on X86 and ARM32
                count += std::popcount(buffer_ptr->Processor.GroupMask[i].Mask);
            }
        }

        return count;
    };

    if (auto count = try_query(reinterpret_cast<buffer_ptr_t>(&buffer), &buffer_size); count != 0) {
        return count;
    }

    if (GetLastError() == ERROR_INSUFFICIENT_BUFFER) {
        std::unique_ptr<unsigned char[]> new_buffer(::new (std::nothrow) unsigned char[buffer_size]);

        if (new_buffer != nullptr) {
            return try_query(reinterpret_cast<buffer_ptr_t>(new_buffer.get()), &buffer_size);
        }
    }

    return 0;
}

unsigned int final_version() noexcept { // return number of processors
    // Most devices have only one processor group and thus have the same buffer_size.
#ifdef _WIN64
    constexpr int stack_buffer_size = 48; // 16 bytes per group
#else // ^^^ 64-bit / 32-bit vvv
    constexpr int stack_buffer_size = 44; // 12 bytes per group
#endif // ^^^ 32-bit ^^^

    alignas(SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX) unsigned char stack_buffer[stack_buffer_size];
    unsigned char* buffer_ptr = stack_buffer;
    DWORD buffer_size         = stack_buffer_size;
    std::unique_ptr<unsigned char[]> new_buffer;

    // https://learn.microsoft.com/en-us/windows/win32/api/sysinfoapi/nf-sysinfoapi-getlogicalprocessorinformationex
    // The buffer "receives a sequence of variable-sized SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX structures".
    for (;;) {
        if (GetLogicalProcessorInformationEx(RelationProcessorPackage,
                reinterpret_cast<PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX>(buffer_ptr), &buffer_size)) {
            unsigned int logical_processors = 0;

            while (buffer_size > 0) {
                // Each structure in the buffer describes a processor package (aka socket)...
                const auto structure_ptr  = reinterpret_cast<PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX>(buffer_ptr);
                const auto structure_size = structure_ptr->Size;

                // ... which contains one or more processor groups.
                for (WORD i = 0; i != structure_ptr->Processor.GroupCount; ++i) {
                    logical_processors += std::popcount(structure_ptr->Processor.GroupMask[i].Mask);
                }

                // Step forward to the next structure in the buffer.
                buffer_ptr += structure_size;
                buffer_size -= structure_size;
            }

            return logical_processors;
        }

        if (GetLastError() != ERROR_INSUFFICIENT_BUFFER) {
            return 0; // API failure
        }

        new_buffer.reset(::new (std::nothrow) unsigned char[buffer_size]);

        if (!new_buffer) {
            return 0; // allocation failure
        }

        buffer_ptr = new_buffer.get();
    }
}

int main() {
    println("System: {}-bit", sizeof(void*) * 8);
    println("hardware_concurrency(): {}", thread::hardware_concurrency());
    println("intermediate_version(): {}", intermediate_version());
    println("       final_version(): {}", final_version());
}

C:\Temp>cl /EHsc /nologo /W4 /std:c++latest /MTd /Od meow.cpp
meow.cpp

C:\Temp>meow
System: 64-bit
hardware_concurrency(): 16
intermediate_version(): 16
       final_version(): 16

StephanTLavavej · 2025-05-08T12:50:26Z

@JoostHouben Can you check the above self-contained test program on your Threadripper? I am also trying to find a multi-socket machine, but having high-core-count validation would help.

StephanTLavavej · 2025-05-08T20:18:09Z

Tested on a 1-socket, 112-core, 112-logical processor ARM64 machine (i.e. no simultaneous multithreading):

PS D:\Users\stl\Downloads> .\arm64_hw_concur.exe
System: 64-bit
hardware_concurrency(): 64
intermediate_version(): 112
       final_version(): 112

Which confirms that @YexuanXiao's changes and my following changes both handle the >64 logical processor scenario.

Now we just need multi-socket...

StephanTLavavej · 2025-05-08T22:28:50Z

I was able to prepare a 4-socket VM, which I believe distributed the 112 logical processors evenly, so 28 logical processors per socket. The test case printed:

C:\Temp>meow
System: 64-bit
hardware_concurrency(): 56
intermediate_version(): 28
       final_version(): 112

This is what I expected: intermediate_version() inspected only the first socket, while final_version() gets the correct answer.

We're good to go!

StephanTLavavej · 2025-05-09T15:50:38Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2025-05-09T21:52:28Z

I had to push another commit to fix the ARM64EC internal build. The use of popcount() in the STL's DLL revealed that (unlike VCRuntime) we weren't linking against softintrin.lib, which is necessary to be able to use the __popcnt64 intrinsic. (Most people don't have to do anything special for this, but because the STL is a defaultlib, we link with /nodefaultlib and so we have to do everything manually.)

StephanTLavavej · 2025-05-10T11:56:17Z

🖥️ 🦖 📈

Fix microsoft#5453 to enable handling of over 64 processors

f6a5f91

[`GetActiveProcessorCount`](https://learn.microsoft.com/zh-cn/windows/win32/api/winbase/nf-winbase-getactiveprocessorcount) [Raymond Chen's claim](https://devblogs.microsoft.com/oldnewthing/20200824-00/?p=104116)

YexuanXiao requested a review from a team as a code owner May 2, 2025 01:14

github-project-automation bot added this to STL Code Reviews May 2, 2025

github-project-automation bot moved this to Initial Review in STL Code Reviews May 2, 2025

This comment was marked as resolved.

Sign in to view

YexuanXiao changed the title ~~Fix #5453 to enable handling of over 64 processors~~ Enable hardware_concurrency to handle more than 64 processors May 2, 2025

AlexGuteniev approved these changes May 2, 2025

View reviewed changes

StephanTLavavej requested changes May 2, 2025

View reviewed changes

stl/src/cthread.cpp Outdated Show resolved Hide resolved

github-project-automation bot moved this from Initial Review to Work In Progress in STL Code Reviews May 2, 2025

StephanTLavavej added the bug Something isn't working label May 2, 2025

Use GetLogicalProcessorInformationEx

968f5d1

YexuanXiao requested review from AlexGuteniev and StephanTLavavej May 3, 2025 15:58

YexuanXiao added 2 commits May 4, 2025 00:02

Add missing header files

c8f3af4

Align the stack buffer and add noexcept to the lambda

66016db

This comment was marked as resolved.

Sign in to view

AlexGuteniev reviewed May 3, 2025

View reviewed changes

stl/src/cthread.cpp Outdated Show resolved Hide resolved

stl/src/cthread.cpp Outdated Show resolved Hide resolved

stl/src/cthread.cpp Outdated Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

Use the #error directive to report unknown architectures

27fc44e

StephanTLavavej self-assigned this May 3, 2025

StephanTLavavej moved this from Work In Progress to Initial Review in STL Code Reviews May 3, 2025

StephanTLavavej changed the title ~~Enable hardware_concurrency to handle more than 64 processors~~ Enable thread::hardware_concurrency() to handle more than 64 processors May 4, 2025

StephanTLavavej added 7 commits May 8, 2025 02:03

Consistency: std:: => _STD

924ea74

Directly test the BOOL returned by GetLogicalProcessorInformationEx.

3494728

Drop comment about the size of Mask; popcount handles this.

dad053a

Detect 64-bit / 32-bit variation with _WIN64.

0cf6aa2

buffer => stack_buffer

2f05336

count => logical_processors

da4724c

frederick-vs-ja reviewed May 8, 2025

View reviewed changes

stl/src/cthread.cpp Outdated Show resolved Hide resolved

stl/src/cthread.cpp Outdated Show resolved Hide resolved

StephanTLavavej added 2 commits May 8, 2025 04:02

Handle multiple sockets.

06b7eae

Add comments.

29bb4fe

StephanTLavavej reviewed May 8, 2025

View reviewed changes

StephanTLavavej removed their assignment May 8, 2025

StephanTLavavej approved these changes May 8, 2025

View reviewed changes

StephanTLavavej moved this from Initial Review to Final Review in STL Code Reviews May 8, 2025

StephanTLavavej moved this from Final Review to Ready To Merge in STL Code Reviews May 8, 2025

StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews May 9, 2025

StephanTLavavej added a commit to StephanTLavavej/STL that referenced this pull request May 9, 2025

microsoftGH-5459

5926b38

Fix the ARM64EC internal build.

c4c52f0

StephanTLavavej approved these changes May 9, 2025

View reviewed changes

StephanTLavavej merged commit 254972a into microsoft:main May 10, 2025
39 checks passed

github-project-automation bot moved this from Merging to Done in STL Code Reviews May 10, 2025

StephanTLavavej mentioned this pull request May 10, 2025

<thread>: (lack of) Support for >64 cores #2099

Closed

cpplearner mentioned this pull request May 11, 2025

stl/CMakeLists.txt: Link against softintrin.lib when targeting ARM64/ARM64EC #5491

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable `thread::hardware_concurrency()` to handle more than 64 processors #5459

Enable `thread::hardware_concurrency()` to handle more than 64 processors #5459

YexuanXiao commented May 2, 2025 •

edited

Loading

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

StephanTLavavej commented May 8, 2025

StephanTLavavej commented May 8, 2025

StephanTLavavej commented May 8, 2025

StephanTLavavej commented May 8, 2025

StephanTLavavej commented May 8, 2025

StephanTLavavej commented May 9, 2025

StephanTLavavej commented May 9, 2025

StephanTLavavej commented May 10, 2025

Enable thread::hardware_concurrency() to handle more than 64 processors #5459

Enable thread::hardware_concurrency() to handle more than 64 processors #5459

Conversation

YexuanXiao commented May 2, 2025 • edited Loading

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

StephanTLavavej commented May 8, 2025

StephanTLavavej commented May 8, 2025

StephanTLavavej commented May 8, 2025

StephanTLavavej commented May 8, 2025

StephanTLavavej commented May 8, 2025

StephanTLavavej commented May 9, 2025

StephanTLavavej commented May 9, 2025

StephanTLavavej commented May 10, 2025

🖥️ 🦖 📈

Enable `thread::hardware_concurrency()` to handle more than 64 processors #5459

Enable `thread::hardware_concurrency()` to handle more than 64 processors #5459

YexuanXiao commented May 2, 2025 •

edited

Loading