-
Notifications
You must be signed in to change notification settings - Fork 726
test_low_latency failed #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
What is your network hardware configuration? Could you please run |
I'm seeing a similar issue:
|
@BigValen It appears that nvshmem cannot initialize ibrc transport, which is typically related to network configuration issues. However, the |
@sphish Same issue. Any help? |
@sphish emmm, some features are not supported on my machine, I will try to fix it. Thank you a lot~~ |
@sphish Hi, output of
|
@liusy58 You need load nvidia_peermem kernel module. |
Thank you~ |
After running the command
|
I suspect this is related to the CUDA driver version. |
@sphish
|
@koanho Can you check if the nvidia-peermem module is correctly installed and loaded? |
Thank you for reply @sphish.
|
@koanho Have you modified drvier config? https://github.com/deepseek-ai/DeepEP/tree/main/third-party#4-configure-nvidia-driver |
Thank you @sphish. |
@koanho If you want to use low latency mode, Yes. If you only want to use the normal mode for training, you can use old version DeepEP, which use IBRC transport. |
I am experiencing an issue with NVSHMEM failing to initialize due to transport errors. The error message indicates that NVSHMEM is unable to detect the system topology and cannot initialize any transport layers. However, test_intranode.py passed successfully...
I would like to know how to resolve this problem.
System Information
GPU Model: H100 (8 GPUs, single node)
OS: Ubuntu 22.04
CUDA Version: 12.5
NVSHMEM Version: 3.2.5
Error Log
The text was updated successfully, but these errors were encountered: