Skip to content

test_low_latency failed #55

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hyesung84 opened this issue Mar 7, 2025 · 17 comments
Open

test_low_latency failed #55

hyesung84 opened this issue Mar 7, 2025 · 17 comments

Comments

@hyesung84
Copy link

I am experiencing an issue with NVSHMEM failing to initialize due to transport errors. The error message indicates that NVSHMEM is unable to detect the system topology and cannot initialize any transport layers. However, test_intranode.py passed successfully...
I would like to know how to resolve this problem.

System Information
GPU Model: H100 (8 GPUs, single node)
OS: Ubuntu 22.04
CUDA Version: 12.5
NVSHMEM Version: 3.2.5

Error Log

WARN: init failed for remote transport: ibrc
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 

/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDAinit failed for transport: IBGDA

/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 

/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error./workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: 
Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 /workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 
nvshmem detect topo failed 


WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 

/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 

/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 

/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 

WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
WARN: init failed for remote transport: ibrc
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDAinit failed for transport: IBGDA/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: 
init failed for transport: IBGDA

/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: init failed for transport: IBGDAUnable to initialize any transports. returning error.init failed for transport: IBGDAUnable to initialize any transports. returning error./workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: Unable to initialize any transports. returning error.


/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: init failed for transport: IBGDAinit failed for transport: IBGDA
nvshmem detect topo failed 
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 Unable to initialize any transports. returning error.


/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 
nvshmem detect topo failed 
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 
/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: 
nvshmem detect topo failed 

nvshmem initialization failed, exiting 
/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: /workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: 

/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: nvshmem initialization failed, exiting 
nvshmem initialization failed, exiting 
nvshmem initialization failed, exiting 



/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: /workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.Unable to initialize any transports. returning error.

/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 /workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 
nvshmem detect topo failed 


/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: /workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: nvshmem initialization failed, exiting 
nvshmem initialization failed, exiting 


/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 

/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: nvshmem initialization failed, exiting 
/workspace/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:287: 
Unable to initialize any transports. returning error.
/workspace/nvshmem_src/src/host/init/init.cu:989: non-zero status: 7 nvshmem detect topo failed 

/workspace/nvshmem_src/src/host/init/init.cu:nvshmemi_check_state_and_init:1074: nvshmem initialization failed, exiting 

W0307 07:36:56.817000 22906 torch/multiprocessing/spawn.py:169] Terminating process 22985 via signal SIGTERM
W0307 07:36:56.817000 22906 torch/multiprocessing/spawn.py:169] Terminating process 22987 via signal SIGTERM
@sphish
Copy link
Collaborator

sphish commented Mar 9, 2025

What is your network hardware configuration? Could you please run nvidia-smi topo -mp and ibv_devinfo and share the results?

@BigValen
Copy link

I'm seeing a similar issue:

root@22f186c3783d:/workspace#
root@22f186c3783d:/workspace# nvidia-smi topo -mp
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    CPU Affinity    NUMA Affinity        GPU NUMA ID
GPU0     X      PHB     PHB     PHB     SYS     SYS     SYS     SYS     NODE    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     0-87    0           N/A
GPU1    PHB      X      PHB     PHB     SYS     SYS     SYS     SYS     NODE    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     0-87    0           N/A
GPU2    PHB     PHB      X      PHB     SYS     SYS     SYS     SYS     NODE    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     0-87    0           N/A
GPU3    PHB     PHB     PHB      X      SYS     SYS     SYS     SYS     NODE    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     0-87    0           N/A
GPU4    SYS     SYS     SYS     SYS      X      PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     88-175  1           N/A
GPU5    SYS     SYS     SYS     SYS     PHB      X      PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     88-175  1           N/A
GPU6    SYS     SYS     SYS     SYS     PHB     PHB      X      PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     88-175  1           N/A
GPU7    SYS     SYS     SYS     SYS     PHB     PHB     PHB      X      SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     88-175  1           N/A
NIC0    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS
NIC1    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     NODE     X      PHB     PHB     PHB     SYS     SYS     SYS     SYS
NIC2    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     NODE    PHB      X      PHB     PHB     SYS     SYS     SYS     SYS
NIC3    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     NODE    PHB     PHB      X      PHB     SYS     SYS     SYS     SYS
NIC4    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     NODE    PHB     PHB     PHB      X      SYS     SYS     SYS     SYS
NIC5    SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS      X      PHB     PHB     PHB
NIC6    SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB      X      PHB     PHB
NIC7    SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB      X      PHB
NIC8    SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8

root@22f186c3783d:/workspace# ibv_devinfo
hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         28.38.1002
        node_guid:                      3eea:72ff:fe24:32af
        sys_image_guid:                 58a2:e103:0048:66de
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000001108
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_1
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      d4fb:b330:a54f:0277
        sys_image_guid:                 946d:ae03:00f0:0b4e
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1689
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_2
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      a879:2436:7090:e75b
        sys_image_guid:                 946d:ae03:00f0:063e
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1691
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_3
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      2dc3:190f:3d85:1cb6
        sys_image_guid:                 946d:ae03:00f0:0b6a
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1690
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_4
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      e70f:f6b9:f338:c9b6
        sys_image_guid:                 946d:ae03:00f0:0302
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1692
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_5
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      4ea0:6489:d37a:7cf7
        sys_image_guid:                 946d:ae03:00fc:eaf6
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1693
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_6
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      ac9a:fa6f:97fa:a093
        sys_image_guid:                 946d:ae03:00fc:ec8c
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1694
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_7
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      fef9:7fce:e85c:939f
        sys_image_guid:                 946d:ae03:00f0:0b68
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1695
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_8
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      ae8f:1005:af4b:5ea7
        sys_image_guid:                 946d:ae03:00f0:0b46
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1696
                        port_lmc:               0x00
                        link_layer:             InfiniBand

@sphish
Copy link
Collaborator

sphish commented Mar 25, 2025

I'm seeing a similar issue:

root@22f186c3783d:/workspace#
root@22f186c3783d:/workspace# nvidia-smi topo -mp
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    CPU Affinity    NUMA Affinity        GPU NUMA ID
GPU0     X      PHB     PHB     PHB     SYS     SYS     SYS     SYS     NODE    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     0-87    0           N/A
GPU1    PHB      X      PHB     PHB     SYS     SYS     SYS     SYS     NODE    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     0-87    0           N/A
GPU2    PHB     PHB      X      PHB     SYS     SYS     SYS     SYS     NODE    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     0-87    0           N/A
GPU3    PHB     PHB     PHB      X      SYS     SYS     SYS     SYS     NODE    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     0-87    0           N/A
GPU4    SYS     SYS     SYS     SYS      X      PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     88-175  1           N/A
GPU5    SYS     SYS     SYS     SYS     PHB      X      PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     88-175  1           N/A
GPU6    SYS     SYS     SYS     SYS     PHB     PHB      X      PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     88-175  1           N/A
GPU7    SYS     SYS     SYS     SYS     PHB     PHB     PHB      X      SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     88-175  1           N/A
NIC0    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS
NIC1    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     NODE     X      PHB     PHB     PHB     SYS     SYS     SYS     SYS
NIC2    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     NODE    PHB      X      PHB     PHB     SYS     SYS     SYS     SYS
NIC3    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     NODE    PHB     PHB      X      PHB     SYS     SYS     SYS     SYS
NIC4    PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     NODE    PHB     PHB     PHB      X      SYS     SYS     SYS     SYS
NIC5    SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS      X      PHB     PHB     PHB
NIC6    SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB      X      PHB     PHB
NIC7    SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB      X      PHB
NIC8    SYS     SYS     SYS     SYS     PHB     PHB     PHB     PHB     SYS     SYS     SYS     SYS     SYS     PHB     PHB     PHB      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8

root@22f186c3783d:/workspace# ibv_devinfo
hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         28.38.1002
        node_guid:                      3eea:72ff:fe24:32af
        sys_image_guid:                 58a2:e103:0048:66de
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000001108
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_1
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      d4fb:b330:a54f:0277
        sys_image_guid:                 946d:ae03:00f0:0b4e
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1689
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_2
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      a879:2436:7090:e75b
        sys_image_guid:                 946d:ae03:00f0:063e
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1691
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_3
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      2dc3:190f:3d85:1cb6
        sys_image_guid:                 946d:ae03:00f0:0b6a
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1690
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_4
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      e70f:f6b9:f338:c9b6
        sys_image_guid:                 946d:ae03:00f0:0302
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1692
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_5
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      4ea0:6489:d37a:7cf7
        sys_image_guid:                 946d:ae03:00fc:eaf6
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1693
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_6
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      ac9a:fa6f:97fa:a093
        sys_image_guid:                 946d:ae03:00fc:ec8c
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1694
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_7
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      fef9:7fce:e85c:939f
        sys_image_guid:                 946d:ae03:00f0:0b68
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1695
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: mlx5_8
        transport:                      InfiniBand (0)
        fw_ver:                         28.37.1700
        node_guid:                      ae8f:1005:af4b:5ea7
        sys_image_guid:                 946d:ae03:00f0:0b46
        vendor_id:                      0x02c9
        vendor_part_id:                 4126
        hw_ver:                         0x0
        board_id:                       MT_0000000970
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               1696
                        port_lmc:               0x00
                        link_layer:             InfiniBand

@BigValen It appears that nvshmem cannot initialize ibrc transport, which is typically related to network configuration issues. However, the ibv_devinfo and nvidia-smi outputs you provided look normal. Could you try running ib_write_bw and nvshmem's shmem_put_bw to see if they work properly? This will help us determine if the issue is specific to nvshmem or if there might be a more general RDMA connectivity problem.

@liusy58
Copy link

liusy58 commented Mar 31, 2025

@sphish Same issue. Any help?

@sphish
Copy link
Collaborator

sphish commented Apr 1, 2025

@sphish Same issue. Any help?

@liusy58 Can you run the NVSHMEM's shmem_put_bw test, and will you encounter the same issue?

@liusy58
Copy link

liusy58 commented Apr 1, 2025

@sphish emmm, some features are not supported on my machine, I will try to fix it. Thank you a lot~~

@liusy58
Copy link

liusy58 commented Apr 2, 2025

@sphish Hi, output of shmem_put_bw is shown below. I cannot resolve this, could you please give me some guidance?

/opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw 
Runtime options after parsing command line arguments 
min_size: 4, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream
Note: Above is full list of options, any given test will use only a subset of these variables.
mype: 0 mype_node: 0 device name: NVIDIA H20 bus id: 8 
/home/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:nvshmemt_init:1851: neither nv_peer_mem, or nvidia_peermem detected. Skipping transport.

/home/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp:nvshmemt_init:3626: neither nv_peer_mem, or nvidia_peermem detected. Skipping transport.

/home/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
This test requires exactly two processes 
Segmentation fault (core dumped)
nvidia-smi topo -mp
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    CPU Affinity NUMA Affinity   GPU NUMA ID
GPU0     X      NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    SYS     SYS     0-47,96-143  0               N/A
GPU1    NODE     X      NODE    NODE    SYS     SYS     SYS     SYS     PIX     NODE    SYS     SYS     0-47,96-143  0               N/A
GPU2    NODE    NODE     X      NODE    SYS     SYS     SYS     SYS     NODE    NODE    SYS     SYS     0-47,96-143  0               N/A
GPU3    NODE    NODE    NODE     X      SYS     SYS     SYS     SYS     NODE    PIX     SYS     SYS     0-47,96-143  0               N/A
GPU4    SYS     SYS     SYS     SYS      X      NODE    NODE    NODE    SYS     SYS     PIX     NODE    48-95,144-191        1               N/A
GPU5    SYS     SYS     SYS     SYS     NODE     X      NODE    NODE    SYS     SYS     NODE    NODE    48-95,144-191        1               N/A
GPU6    SYS     SYS     SYS     SYS     NODE    NODE     X      NODE    SYS     SYS     NODE    PIX     48-95,144-191        1               N/A
GPU7    SYS     SYS     SYS     SYS     NODE    NODE    NODE     X      SYS     SYS     NODE    NODE    48-95,144-191        1               N/A
NIC0    NODE    PIX     NODE    NODE    SYS     SYS     SYS     SYS      X      NODE    SYS     SYS
NIC1    NODE    NODE    NODE    PIX     SYS     SYS     SYS     SYS     NODE     X      SYS     SYS
NIC2    SYS     SYS     SYS     SYS     PIX     NODE    NODE    NODE    SYS     SYS      X      NODE
NIC3    SYS     SYS     SYS     SYS     NODE    NODE    PIX     NODE    SYS     SYS     NODE     X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge

NIC Legend:

  NIC0: mlx5_bond_0
  NIC1: mlx5_bond_1
  NIC2: mlx5_bond_2
  NIC3: mlx5_bond_3
ibv_devinfo
hca_id: mlx5_bond_0
        transport:                      InfiniBand (0)
        fw_ver:                         32.39.3804
        node_guid:                      5c25:7303:00f0:052a
        sys_image_guid:                 5c25:7303:00f0:052a
        vendor_id:                      0x02c9
        vendor_part_id:                 41692
        hw_ver:                         0x1
        board_id:                       MT_0000001093
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_bond_1
        transport:                      InfiniBand (0)
        fw_ver:                         32.39.3804
        node_guid:                      5c25:7303:00f0:07ea
        sys_image_guid:                 5c25:7303:00f0:07ea
        vendor_id:                      0x02c9
        vendor_part_id:                 41692
        hw_ver:                         0x1
        board_id:                       MT_0000001093
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_bond_2
        transport:                      InfiniBand (0)
        fw_ver:                         32.39.3804
        node_guid:                      5c25:7303:00f0:0800
        sys_image_guid:                 5c25:7303:00f0:0800
        vendor_id:                      0x02c9
        vendor_part_id:                 41692
        hw_ver:                         0x1
        board_id:                       MT_0000001093
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_bond_3
        transport:                      InfiniBand (0)
        fw_ver:                         32.39.3804
        node_guid:                      5c25:7303:00f0:0556
        sys_image_guid:                 5c25:7303:00f0:0556
        vendor_id:                      0x02c9
        vendor_part_id:                 41692
        hw_ver:                         0x1
        board_id:                       MT_0000001093
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

@sphish
Copy link
Collaborator

sphish commented Apr 2, 2025

@sphish Hi, output of shmem_put_bw is shown below. I cannot resolve this, could you please give me some guidance?

/opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw 
Runtime options after parsing command line arguments 
min_size: 4, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream
Note: Above is full list of options, any given test will use only a subset of these variables.
mype: 0 mype_node: 0 device name: NVIDIA H20 bus id: 8 
/home/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:nvshmemt_init:1851: neither nv_peer_mem, or nvidia_peermem detected. Skipping transport.

/home/nvshmem_src/src/modules/transport/ibgda/ibgda.cpp:nvshmemt_init:3626: neither nv_peer_mem, or nvidia_peermem detected. Skipping transport.

/home/nvshmem_src/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA
This test requires exactly two processes 
Segmentation fault (core dumped)

@liusy58 You need load nvidia_peermem kernel module.

@liusy58
Copy link

liusy58 commented Apr 3, 2025

Thank you~

@Cydia2018
Copy link

@sphish Same issue. Any help?

@liusy58 Can you run the NVSHMEM's shmem_put_bw test, and will you encounter the same issue?

After running the command shmem_put_bw, I encountered the following error. Could you give me further guidance? Thanks a lot.

/opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw 
Runtime options after parsing command line arguments 
min_size: 4, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream
Note: Above is full list of options, any given test will use only a subset of these variables.
mype: 0 mype_node: 0 device name: NVIDIA H800 bus id: 25 
This test requires exactly two processes 
[/xxx/nvshmem_src/perftest/common/utils.cu:408] cuda failed with invalid argument

@sphish
Copy link
Collaborator

sphish commented Apr 25, 2025

@sphish Same issue. Any help?

@liusy58 Can you run the NVSHMEM's shmem_put_bw test, and will you encounter the same issue?

After running the command shmem_put_bw, I encountered the following error. Could you give me further guidance? Thanks a lot.

/opt/nvshmem/bin/perftest/device/pt-to-pt/shmem_put_bw 
Runtime options after parsing command line arguments 
min_size: 4, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream
Note: Above is full list of options, any given test will use only a subset of these variables.
mype: 0 mype_node: 0 device name: NVIDIA H800 bus id: 25 
This test requires exactly two processes 
[/xxx/nvshmem_src/perftest/common/utils.cu:408] cuda failed with invalid argument

I suspect this is related to the CUDA driver version.

@koanho
Copy link

koanho commented Apr 27, 2025

@sphish
Hi, I got a similar issue. When testing ./shmem_put_bw, got an error below.

Runtime options after parsing command line arguments 
min_size: 4, max_size: 4194304, step_factor: 2, iterations: 10, warmup iterations: 5, number of ctas: 32, threads per cta: 256 stride: 1, datatype: int, reduce_op: sum, threadgroup_scope: all_scopes, atomic_op: inc, dir: write, report_msgrate: 0, bidirectional: 0, putget_issue :on_stream
Note: Above is full list of options, any given test will use only a subset of these variables.
mype: 0 mype_node: 0 device name: NVIDIA H100 80GB HBM3 bus id: 10 
WARN: cudaHostRegister with IoMemory failed with error=800. We may need to use a fallback path.

WARN: ibgda_nic_mem_gpu_map failed. We may need to use the CPU fallback path.

WARN: ibgda_alloc_and_map_qp_uar with GPU as handler failed. We may need to enter the CPU fallback path.

WARN: GPU cannot map UAR of device mlx5_0. Skipping...

WARN: cudaHostRegister with IoMemory failed with error=800. We may need to use a fallback path.

WARN: ibgda_nic_mem_gpu_map failed. We may need to use the CPU fallback path.

WARN: ibgda_alloc_and_map_qp_uar with GPU as handler failed. We may need to enter the CPU fallback path.

WARN: GPU cannot map UAR of device mlx5_1. Skipping...

WARN: cudaHostRegister with IoMemory failed with error=800. We may need to use a fallback path.

WARN: ibgda_nic_mem_gpu_map failed. We may need to use the CPU fallback path.

WARN: ibgda_alloc_and_map_qp_uar with GPU as handler failed. We may need to enter the CPU fallback path.

WARN: GPU cannot map UAR of device mlx5_2. Skipping...

WARN: cudaHostRegister with IoMemory failed with error=800. We may need to use a fallback path.

WARN: ibgda_nic_mem_gpu_map failed. We may need to use the CPU fallback path.

WARN: ibgda_alloc_and_map_qp_uar with GPU as handler failed. We may need to enter the CPU fallback path.

WARN: GPU cannot map UAR of device mlx5_3. Skipping...

WARN: cudaHostRegister with IoMemory failed with error=800. We may need to use a fallback path.

WARN: ibgda_nic_mem_gpu_map failed. We may need to use the CPU fallback path.

WARN: ibgda_alloc_and_map_qp_uar with GPU as handler failed. We may need to enter the CPU fallback path.

WARN: GPU cannot map UAR of device mlx5_4. Skipping...

/home/dpsk_a2a/deepep-nvshmem/src/host/transport/transport.cpp:nvshmemi_transport_init:275: init failed for transport: IBGDA

@sphish
Copy link
Collaborator

sphish commented Apr 27, 2025

@koanho Can you check if the nvidia-peermem module is correctly installed and loaded?

@koanho
Copy link

koanho commented Apr 27, 2025

Thank you for reply @sphish.
I think nvidia-peermem is correctly installed and loaded.

Singularity> modinfo nvidia-peermem
filename:       /lib/modules/5.14.0-284.11.1.el9_2.x86_64/extra/nvidia-peermem.ko
version:        550.54.15
license:        Dual BSD/GPL
description:    NVIDIA GPU memory plug-in
author:         Yishai Hadas
rhelversion:    9.2
srcversion:     B13C9DFD8CD4E8BE2B5D362
depends:        nvidia,ib_core
retpoline:      Y
name:           nvidia_peermem
vermagic:       5.14.0-284.11.1.el9_2.x86_64 SMP preempt mod_unload modversions 
parm:           peerdirect_support:Set level of support for Peer-direct, 0 [default] or 1 [legacy, for example MLNX_OFED 4.9 LTS] (int)
Singularity> lsmod | grep nvidia_peermem
nvidia_peermem         20480  0
ib_core               491520  25 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
nvidia               8626176  1106 nvidia_uvm,nvidia_peermem,nvidia_fs,gdrdrv,nvidia_modeset

@sphish
Copy link
Collaborator

sphish commented Apr 27, 2025

@koanho
Copy link

koanho commented Apr 27, 2025

Thank you @sphish.
I couldn't modify the driver configuration because I don't have root permissions on my training cluster 😞
It seems the error may have occurred because IBGDA is not properly enabled.
Is IBGDA necessary to use DeepEP, right?

@sphish
Copy link
Collaborator

sphish commented Apr 27, 2025

Is IBGDA necessary to use DeepEP, right?

@koanho If you want to use low latency mode, Yes. If you only want to use the normal mode for training, you can use old version DeepEP, which use IBRC transport.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants