Closed
Description
I can run test_internode.py
and test_intranode.py
correctly, but I cannot run the test_low_latency .py
script. The error reported is mainly related to IBGDA(more complete output is at the end):
/repo/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
I found related issues 38, but it doesn't seem relevant to my error.
How should I run low_latency on RoCE? Are any extra settings required? Any help you can provide would be greatly appreciated.
Environment
- Two nodes, each with 8 H20 GPUs, connected by CX6 network (200Gb, ~25GB/s) cards using RoCE mode.
- Using the latest DeepEP code.
- nvshmem: 3.2.5-1
- gdrcopy: 2.4.4
I have followed the NVSHMEM install guide, manually compiled and loaded the gdrdrv
module, set the nvidia driver IBGDA-related options in /etc/modprobe.d
(Even though I use RoCE not IB), and my command to compile nvshmem is as follows:
CUDA_HOME=/opt/lib/cuda-12.4.1_normal/ \
GDRCOPY_HOME=/repo/gdrcopy-2.4.4 \
NVSHMEM_SHMEM_SUPPORT=0 \
NVSHMEM_UCX_SUPPORT=0 \
NVSHMEM_USE_NCCL=0 \
NVSHMEM_MPI_SUPPORT=0 \
NVSHMEM_IBGDA_SUPPORT=1 \
NVSHMEM_PMIX_SUPPORT=0 \
NVSHMEM_TIMEOUT_DEVICE_POLLING=0 \
NVSHMEM_USE_GDRCOPY=1 \
cmake -S . -B build/ -DCMAKE_INSTALL_PREFIX=/repo/nvshmem_src/install
ibv_devinfo output
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 20.35.2000
node_guid: 946d:ae03:009c:f3cc
sys_image_guid: 946d:ae03:009c:f3cc
vendor_id: 0x02c9
vendor_part_id: 4123
hw_ver: 0x0
board_id: MT_0000000223
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_1
transport: InfiniBand (0)
fw_ver: 20.35.2000
node_guid: 946d:ae03:009c:f454
sys_image_guid: 946d:ae03:009c:f454
vendor_id: 0x02c9
vendor_part_id: 4123
hw_ver: 0x0
board_id: MT_0000000223
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
...
(omit mlx5_2 - mlx5_9)
Log
$ MASTER_ADDR=xxx MASTER_PORT=8362 WORLD_SIZE=2 RANK=0 python tests/test_low_latency.py
Allocating buffer size: 2116.290944 MB ...
WARN: cudaHostRegister with IoMemory failed with error=800. We may need to use a fallback path.
WARN: ibgda_nic_mem_gpu_map failed. We may need to use the CPU fallback path.
WARN: ibgda_alloc_and_map_qp_uar with GPU as handler failed. We may need to enter the CPU fallback path.
WARN: GPU cannot map UAR of device mlx5_0. Skipping...
...
WARN: ibgda_nic_mem_gpu_map failed. We may need to use the CPU fallback path.
WARN: ibgda_alloc_and_map_qp_uar with GPU as handler failed. We may need to enter the CPU fallback path.
WARN: GPU cannot map UAR of device mlx5_3. Skipping...
WARN: GPU cannot map UAR of device mlx5_8. Skipping...
WARN: GPU cannot map UAR of device mlx5_1. Skipping...
WARN: GPU cannot map UAR of device mlx5_7. Skipping...
WARN: GPU cannot map UAR of device mlx5_6. Skipping...
WARN: GPU cannot map UAR of device mlx5_bond_0. Skipping...
/repo/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
WARN: GPU cannot map UAR of device mlx5_bond_0. Skipping...
/repo/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
WARN: cudaHostRegister with IoMemory failed with error=800. We may need to use a fallback path.
WARN: ibgda_nic_mem_gpu_map failed. We may need to use the CPU fallback path.
...
/repo/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/repo/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_net_recv:99: Message truncated : received 16 bytes instead of 8
/repo/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:499: non-zero status: -3 /repo/nvshmem_src/src/host/mem/mem_heap.cpp:222: non-zero status: -3 allgather of heap base for all PE failed
/repo/nvshmem_src/src/host/mem/mem_heap.cpp:588: non-zero status: 7 Failed to allgather PEs peer_base values
/repo/nvshmem_src/src/host/init/init.cu:1011: non-zero status: 7 nvshmem register static heaps failed
/repo/nvshmem_src/src/host/team/team.cu:nvshmem_team_split_strided:63: NVSHMEM API called before NVSHMEM initialization has completed
Metadata
Metadata
Assignees
Labels
No labels