Support multi-QP for normal kernels #130

LyricZhao · 2025-04-22T01:25:58Z

This PR is authored by Tencent Network Platform Department. Thanks for the contribution! Now normal kernels have a huge speedup:

Type	Dispatch #EP	Bottleneck bandwidth	Combine #EP	Bottleneck bandwidth
Internode	32	44 👉🏻 58 GB/s (RDMA)	32	47 👉🏻 57 GB/s (RDMA)
Internode	64	46 👉🏻 51 GB/s (RDMA)	64	45 👉🏻 50 GB/s (RDMA)

Through in-depth optimization, the following enhancements have been implemented in the internode normal kernels.

Replacing IBRC with IBGDA
Utilizing distinct QPs (Queue Pairs) per channel for parallel data transmission

These improvements not only enhance the robustness of the internode normal kernels in scenarios involving dual-port NICs and RoCE networks but also further elevate communication performance.

NOTES: the bandwidth in the table is algorithm bandwidth, for physical bandwidth, e.g. 58 * 3/4 = 43.5.

… transmission, a single QP is used for data transfer between two GPUs, which limits kernel performance in network card dual-port and RoCE network scenarios. In our optimized Internode Normal Kernel, we implemented multiple QPs for data transmission between two GPUs, setting a different QP for each channel. Additionally, we modified the transmission method from IBRC to IBGAD. Through these optimizations, the Internode Normal Kernel achieves optimal performance in both H800 and H20 environments, with RDMA transmission performance nearly reaching the physical network performance limit. Using the current default statistical method, in 4-node H800 and H20 environments, RDMA performance can reach 60GB/s+.

…EP into trmt/internode_multi_qp

ethnzhng · 2025-04-23T00:34:57Z

Hi, thanks for the awesome open-source work. For the new normal internode kernels, do we have insight into the reason why the bandwidth decreases from EP32 --> 64, inversely to the increase from EP16 --> 32? Or was the bandwidth for EP16 not updated in the table?

sphish · 2025-04-23T01:45:13Z

@ethnzhng

For the EP16 case: the performance is almost the same on our H800 server. However, the test results from the Tencent team on the H20 server demonstrated a significant performance improvement. I think it is because the NVLink on H800 has become the bottleneck in this test case.
For the bandwidth decreasing from EP32 -> 64: the test results shown here represent algorithmic bandwidth. In reality, for the EP32 case, 1/4 of the data doesn't go through the RDMA network, so the physical bandwidth should be multiplied by 3/4. Similarly, for EP64, the physical bandwidth should be multiplied by 7/8. If you compare using physical bandwidth, you'll find they are essentially the same.

ethnzhng · 2025-04-23T02:20:10Z

@sphish I see, thanks for the explanation! Maybe it would be helpful to clarify in the README table that the Bottleneck bandwidth includes both NVLink + RDMA even though just RDMA is the bottleneck for the higher EPs.

18018681653 · 2025-04-28T05:36:28Z

I found that although the low latency kernel has multiple QPS, due to the imbalance problem of the moe itself, the amount of data sent by each qp is also imbalance. Therefore, the performance under dual-port is worse than that under single-port. Do you have a solution

whybeyoung · 2025-04-28T11:16:41Z

LGTM

sphish · 2025-04-29T01:39:35Z

@18018681653 Since we don't use dual-port NICs ourselves, we currently won't be adapting to dual-port support. The good news is that I asked the Tencent team @moningchen , and they may open-source their dual-port solution in the future. You can keep an eye on them!

GHGmc2 · 2025-04-29T06:44:55Z

May I know why can we get higher bandwidth than 50GB/s since CX7 IB NIC has only 400Gbps(50GB/s)?

sphish · 2025-04-29T06:50:17Z

@GHGmc2 As mentioned above, what we report is the algorithmic bandwidth, not the physical bandwidth.

moningchen and others added 7 commits April 21, 2025 15:50

Revert ibgda_device.cuh and remove some comments.

e2c5784

Add the performance data after internode optimization in the Readme file

e0eaaf9

Merge branch 'trmt/internode_multi_qp' of github.com:deepseek-ai/Deep…

c07fdd1

…EP into trmt/internode_multi_qp

Refactor some code.

20b2aaa

Normal kernels always use IBGDA mode.

3e54b78

Several code lints

edbb1bc

LyricZhao mentioned this pull request Apr 22, 2025

Confuse about the estimated bandwidth on tests/test_internode.py #51

Open

sphish added 2 commits April 22, 2025 11:23

Fix the performance data.

3b1045d

Use put_nbi_warp.

e255d57

LyricZhao merged commit 007fcfc into main Apr 22, 2025

TheRainstorm mentioned this pull request Apr 29, 2025

Running low_latency test on RoCE get IBGDA error #139

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multi-QP for normal kernels #130

Support multi-QP for normal kernels #130

LyricZhao commented Apr 22, 2025 •

edited

Loading

ethnzhng commented Apr 23, 2025

sphish commented Apr 23, 2025

ethnzhng commented Apr 23, 2025 •

edited

Loading

18018681653 commented Apr 28, 2025

whybeyoung commented Apr 28, 2025

sphish commented Apr 29, 2025

GHGmc2 commented Apr 29, 2025

sphish commented Apr 29, 2025

Support multi-QP for normal kernels #130

Support multi-QP for normal kernels #130

Conversation

LyricZhao commented Apr 22, 2025 • edited Loading

ethnzhng commented Apr 23, 2025

sphish commented Apr 23, 2025

ethnzhng commented Apr 23, 2025 • edited Loading

18018681653 commented Apr 28, 2025

whybeyoung commented Apr 28, 2025

sphish commented Apr 29, 2025

GHGmc2 commented Apr 29, 2025

sphish commented Apr 29, 2025

LyricZhao commented Apr 22, 2025 •

edited

Loading

ethnzhng commented Apr 23, 2025 •

edited

Loading