Running low_latency test on RoCE get IBGDA error

I can run `test_internode.py` and `test_intranode.py` correctly, but I cannot run the `test_low_latency .py` script. The error reported is mainly related to IBGDA(more complete output is at the end):
```shell
/repo/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
```

I found related [issues 38](https://github.com/deepseek-ai/DeepEP/issues/38), but it doesn't seem relevant to my error.

How should I run low_latency on RoCE? Are any extra settings required? Any help you can provide would be greatly appreciated.

### Environment

- Two nodes, each with 8 H20 GPUs, connected by CX6 network (200Gb, ~25GB/s) cards using RoCE mode.
- Using the latest DeepEP code.
- nvshmem: 3.2.5-1
- gdrcopy: 2.4.4

I have followed the [NVSHMEM install guide](https://github.com/deepseek-ai/DeepEP/blob/main/third-party/README.md), manually compiled and loaded the `gdrdrv` module, set the nvidia driver IBGDA-related options in `/etc/modprobe.d` (Even though I use RoCE not IB), and my command to compile nvshmem is as follows:
```shell
CUDA_HOME=/opt/lib/cuda-12.4.1_normal/ \
GDRCOPY_HOME=/repo/gdrcopy-2.4.4 \
NVSHMEM_SHMEM_SUPPORT=0 \
NVSHMEM_UCX_SUPPORT=0 \
NVSHMEM_USE_NCCL=0 \
NVSHMEM_MPI_SUPPORT=0 \
NVSHMEM_IBGDA_SUPPORT=1 \
NVSHMEM_PMIX_SUPPORT=0 \
NVSHMEM_TIMEOUT_DEVICE_POLLING=0 \
NVSHMEM_USE_GDRCOPY=1 \
cmake -S . -B build/ -DCMAKE_INSTALL_PREFIX=/repo/nvshmem_src/install
```

ibv_devinfo output
```
hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         20.35.2000
        node_guid:                      946d:ae03:009c:f3cc
        sys_image_guid:                 946d:ae03:009c:f3cc
        vendor_id:                      0x02c9
        vendor_part_id:                 4123
        hw_ver:                         0x0
        board_id:                       MT_0000000223
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_1
        transport:                      InfiniBand (0)
        fw_ver:                         20.35.2000
        node_guid:                      946d:ae03:009c:f454
        sys_image_guid:                 946d:ae03:009c:f454
        vendor_id:                      0x02c9
        vendor_part_id:                 4123
        hw_ver:                         0x0
        board_id:                       MT_0000000223
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

...
(omit mlx5_2 - mlx5_9)
```

### Log

```
$ MASTER_ADDR=xxx MASTER_PORT=8362 WORLD_SIZE=2 RANK=0 python tests/test_low_latency.py

Allocating buffer size: 2116.290944 MB ...
WARN: cudaHostRegister with IoMemory failed with error=800. We may need to use a fallback path.

WARN: ibgda_nic_mem_gpu_map failed. We may need to use the CPU fallback path.

WARN: ibgda_alloc_and_map_qp_uar with GPU as handler failed. We may need to enter the CPU fallback path.

WARN: GPU cannot map UAR of device mlx5_0. Skipping...

...


WARN: ibgda_nic_mem_gpu_map failed. We may need to use the CPU fallback path.
WARN: ibgda_alloc_and_map_qp_uar with GPU as handler failed. We may need to enter the CPU fallback path.

WARN: GPU cannot map UAR of device mlx5_3. Skipping...

WARN: GPU cannot map UAR of device mlx5_8. Skipping...

WARN: GPU cannot map UAR of device mlx5_1. Skipping...

WARN: GPU cannot map UAR of device mlx5_7. Skipping...

WARN: GPU cannot map UAR of device mlx5_6. Skipping...

WARN: GPU cannot map UAR of device mlx5_bond_0. Skipping...

/repo/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
WARN: GPU cannot map UAR of device mlx5_bond_0. Skipping...

/repo/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
WARN: cudaHostRegister with IoMemory failed with error=800. We may need to use a fallback path.



WARN: ibgda_nic_mem_gpu_map failed. We may need to use the CPU fallback path.


...

/repo/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/repo/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:bootstrap_net_recv:99: Message truncated : received 16 bytes instead of 8

/repo/nvshmem_src/src/modules/bootstrap/uid/bootstrap_uid.cpp:499: non-zero status: -3 /repo/nvshmem_src/src/host/mem/mem_heap.cpp:222: non-zero status: -3 allgather of heap base for all PE failed

/repo/nvshmem_src/src/host/mem/mem_heap.cpp:588: non-zero status: 7 Failed to allgather PEs peer_base values

/repo/nvshmem_src/src/host/init/init.cu:1011: non-zero status: 7 nvshmem register static heaps failed

/repo/nvshmem_src/src/host/team/team.cu:nvshmem_team_split_strided:63: NVSHMEM API called before NVSHMEM initialization has completed
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Running low_latency test on RoCE get IBGDA error #139

Environment

Log

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Running low_latency test on RoCE get IBGDA error #139

Description

Environment

Log

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions