-
Notifications
You must be signed in to change notification settings - Fork 727
Support multi-QP for normal kernels #130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… transmission, a single QP is used for data transfer between two GPUs, which limits kernel performance in network card dual-port and RoCE network scenarios. In our optimized Internode Normal Kernel, we implemented multiple QPs for data transmission between two GPUs, setting a different QP for each channel. Additionally, we modified the transmission method from IBRC to IBGAD. Through these optimizations, the Internode Normal Kernel achieves optimal performance in both H800 and H20 environments, with RDMA transmission performance nearly reaching the physical network performance limit. Using the current default statistical method, in 4-node H800 and H20 environments, RDMA performance can reach 60GB/s+.
…EP into trmt/internode_multi_qp
Hi, thanks for the awesome open-source work. For the new normal internode kernels, do we have insight into the reason why the bandwidth decreases from EP32 --> 64, inversely to the increase from EP16 --> 32? Or was the bandwidth for EP16 not updated in the table? |
|
@sphish I see, thanks for the explanation! Maybe it would be helpful to clarify in the README table that the |
I found that although the low latency kernel has multiple QPS, due to the imbalance problem of the moe itself, the amount of data sent by each qp is also imbalance. Therefore, the performance under dual-port is worse than that under single-port. Do you have a solution |
LGTM |
@18018681653 Since we don't use dual-port NICs ourselves, we currently won't be adapting to dual-port support. The good news is that I asked the Tencent team @moningchen , and they may open-source their dual-port solution in the future. You can keep an eye on them! |
May I know why can we get higher bandwidth than 50GB/s since CX7 IB NIC has only 400Gbps(50GB/s)? |
@GHGmc2 As mentioned above, what we report is the algorithmic bandwidth, not the physical bandwidth. |
This PR is authored by Tencent Network Platform Department. Thanks for the contribution! Now normal kernels have a huge speedup:
Through in-depth optimization, the following enhancements have been implemented in the internode normal kernels.
These improvements not only enhance the robustness of the internode normal kernels in scenarios involving dual-port NICs and RoCE networks but also further elevate communication performance.
NOTES: the bandwidth in the table is algorithm bandwidth, for physical bandwidth, e.g.
58 * 3/4 = 43.5
.