Skip to content

Support multi-QP for normal kernels #130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Apr 22, 2025
Merged

Conversation

LyricZhao
Copy link
Collaborator

@LyricZhao LyricZhao commented Apr 22, 2025

This PR is authored by Tencent Network Platform Department. Thanks for the contribution! Now normal kernels have a huge speedup:

Type Dispatch #EP Bottleneck bandwidth Combine #EP Bottleneck bandwidth
Internode 32 44 👉🏻 58 GB/s (RDMA) 32 47 👉🏻 57 GB/s (RDMA)
Internode 64 46 👉🏻 51 GB/s (RDMA) 64 45 👉🏻 50 GB/s (RDMA)

Through in-depth optimization, the following enhancements have been implemented in the internode normal kernels.

  • Replacing IBRC with IBGDA
  • Utilizing distinct QPs (Queue Pairs) per channel for parallel data transmission

These improvements not only enhance the robustness of the internode normal kernels in scenarios involving dual-port NICs and RoCE networks but also further elevate communication performance.

NOTES: the bandwidth in the table is algorithm bandwidth, for physical bandwidth, e.g. 58 * 3/4 = 43.5.

moningchen and others added 7 commits April 21, 2025 15:50
… transmission, a single QP is used for data transfer between two GPUs, which limits kernel performance in network card dual-port and RoCE network scenarios.

In our optimized Internode Normal Kernel, we implemented multiple QPs for data transmission between two GPUs, setting a different QP for each channel. Additionally, we modified the transmission method from IBRC to IBGAD.

Through these optimizations, the Internode Normal Kernel achieves optimal performance in both H800 and H20 environments, with RDMA transmission performance nearly reaching the physical network performance limit. Using the current default statistical method, in 4-node H800 and H20 environments, RDMA performance can reach 60GB/s+.
@LyricZhao LyricZhao merged commit 007fcfc into main Apr 22, 2025
@ethnzhng
Copy link

Hi, thanks for the awesome open-source work. For the new normal internode kernels, do we have insight into the reason why the bandwidth decreases from EP32 --> 64, inversely to the increase from EP16 --> 32? Or was the bandwidth for EP16 not updated in the table?

@sphish
Copy link
Collaborator

sphish commented Apr 23, 2025

@ethnzhng

  1. For the EP16 case: the performance is almost the same on our H800 server. However, the test results from the Tencent team on the H20 server demonstrated a significant performance improvement. I think it is because the NVLink on H800 has become the bottleneck in this test case.
  2. For the bandwidth decreasing from EP32 -> 64: the test results shown here represent algorithmic bandwidth. In reality, for the EP32 case, 1/4 of the data doesn't go through the RDMA network, so the physical bandwidth should be multiplied by 3/4. Similarly, for EP64, the physical bandwidth should be multiplied by 7/8. If you compare using physical bandwidth, you'll find they are essentially the same.

@ethnzhng
Copy link

ethnzhng commented Apr 23, 2025

@sphish I see, thanks for the explanation! Maybe it would be helpful to clarify in the README table that the Bottleneck bandwidth includes both NVLink + RDMA even though just RDMA is the bottleneck for the higher EPs.

@18018681653
Copy link

I found that although the low latency kernel has multiple QPS, due to the imbalance problem of the moe itself, the amount of data sent by each qp is also imbalance. Therefore, the performance under dual-port is worse than that under single-port. Do you have a solution

@whybeyoung
Copy link

LGTM

@sphish
Copy link
Collaborator

sphish commented Apr 29, 2025

@18018681653 Since we don't use dual-port NICs ourselves, we currently won't be adapting to dual-port support. The good news is that I asked the Tencent team @moningchen , and they may open-source their dual-port solution in the future. You can keep an eye on them!

@GHGmc2
Copy link

GHGmc2 commented Apr 29, 2025

May I know why can we get higher bandwidth than 50GB/s since CX7 IB NIC has only 400Gbps(50GB/s)?

@sphish
Copy link
Collaborator

sphish commented Apr 29, 2025

@GHGmc2 As mentioned above, what we report is the algorithmic bandwidth, not the physical bandwidth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants