-
Notifications
You must be signed in to change notification settings - Fork 727
Running low_latency test on RoCE get IBGDA error #139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I remember that it was configured before; I will double-check later as I currently don't have administrator privileges. However, these two options are used to enable IBGDA, and since IBGDA is an IB feature, is it also necessary in a RoCE environment? |
I'm not sure whether CX6 supports IBGDA. |
The name IBGDA might be a bit confusing. It actually applies to both IB and RoCE. In the latest version of DeepEP, this is necessary for both. |
Thank you, I will check both. |
I checked that CX-6 supports IBGDA. Here is an article comparing CX6 performance with IBGDA and IBRC. And I check the nvidia option is configured by running
I'm sorry that what I was running before was not the latest version, but rather a version from before #130 was merged. After updating to the new version, my error message changed. Now I am getting the error: $ MASTER_ADDR=172.19.33.83 MASTER_PORT=8362 WORLD_SIZE=2 RANK=0 python tests/test_low_latency.py 2>&1 |tee r_lowlatency2.log
Allocating buffer size: 2115.111296 MB ...
[/nvshmem_src/src/host/stream/coll/barrier/barrier.cu:36] cuda failed with too many blocks in cooperative launch
[/nvshmem_src/src/host/stream/coll/barrier/barrier.cu:36] cuda failed with too many blocks in cooperative launch
[/nvshmem_src/src/host/stream/coll/barrier/barrier.cu:36] cuda failed with too many blocks in cooperative launch
[/nvshmem_src/src/host/stream/coll/barrier/barrier.cu:36] cuda failed with too many blocks in cooperative launch
[/nvshmem_src/src/host/stream/coll/barrier/barrier.cu:36] cuda failed with too many blocks in cooperative launch
[/nvshmem_src/src/host/stream/coll/barrier/barrier.cu:36] cuda failed with too many blocks in cooperative launch
[/nvshmem_src/src/host/stream/coll/barrier/barrier.cu:36] cuda failed with too many blocks in cooperative launch
[/nvshmem_src/src/host/stream/coll/barrier/barrier.cu:36] cuda failed with too many blocks in cooperative launch
W0429 20:14:08.665000 145833 /miniconda3/envs/tcu12/lib/python3.11/site-packages/torch/multiprocessing/spawn.py:169] Terminating process 145915 via signal SIGTERM
W0429 20:14:08.665000 145833 /miniconda3/envs/tcu12/lib/python3.11/site-packages/torch/multiprocessing/spawn.py:169] Terminating process 145917 via signal SIGTERM
Traceback (most recent call last):
File "/repo/DeepEP/tests/test_low_latency.py", line 172, in <module>
torch.multiprocessing.spawn(test_loop, args=(num_processes,), nprocs=num_processes)
File "/miniconda3/envs/tcu12/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 340, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/miniconda3/envs/tcu12/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
while not context.join():
^^^^^^^^^^^^^^
File "/miniconda3/envs/tcu12/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 215, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 4 terminated with the following error:
Traceback (most recent call last):
File "/miniconda3/envs/tcu12/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
fn(i, *args)
File "/repo/DeepEP/tests/test_low_latency.py", line 158, in test_loop
test_main(num_tokens, hidden, num_experts, num_topk, rank, num_ranks, group, buffer, seed=1)
File "/repo/DeepEP/tests/test_low_latency.py", line 40, in test_main
buffer.low_latency_dispatch(x, topk_idx, num_tokens, num_experts, use_fp8=dispatch_use_fp8,
File "/repo/DeepEP/deep_ep/buffer.py", line 488, in low_latency_dispatch
self.runtime.low_latency_dispatch(x, topk_idx,
RuntimeError: Failed: CUDA error /repo/DeepEP/csrc/kernels/internode_ll.cu:341 'too many blocks in cooperative launch' My NVSHMEM version is 3.2.5-1, compiled with the latest patch. Do you have any suggestions? |
I can run
test_internode.py
andtest_intranode.py
correctly, but I cannot run thetest_low_latency .py
script. The error reported is mainly related to IBGDA(more complete output is at the end):/repo/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
I found related issues 38, but it doesn't seem relevant to my error.
How should I run low_latency on RoCE? Are any extra settings required? Any help you can provide would be greatly appreciated.
Environment
I have followed the NVSHMEM install guide, manually compiled and loaded the
gdrdrv
module, set the nvidia driver IBGDA-related options in/etc/modprobe.d
(Even though I use RoCE not IB), and my command to compile nvshmem is as follows:CUDA_HOME=/opt/lib/cuda-12.4.1_normal/ \ GDRCOPY_HOME=/repo/gdrcopy-2.4.4 \ NVSHMEM_SHMEM_SUPPORT=0 \ NVSHMEM_UCX_SUPPORT=0 \ NVSHMEM_USE_NCCL=0 \ NVSHMEM_MPI_SUPPORT=0 \ NVSHMEM_IBGDA_SUPPORT=1 \ NVSHMEM_PMIX_SUPPORT=0 \ NVSHMEM_TIMEOUT_DEVICE_POLLING=0 \ NVSHMEM_USE_GDRCOPY=1 \ cmake -S . -B build/ -DCMAKE_INSTALL_PREFIX=/repo/nvshmem_src/install
ibv_devinfo output
Log
The text was updated successfully, but these errors were encountered: