Skip to content

OpenMPI combined with RDMA. #12999

Open
Open
@xiaojiesi

Description

@xiaojiesi

Please submit all the information below so that we can understand the working environment that is the context for your question.

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

Package: Open MPI root@sharp-ci-02 Distribution
Open MPI: 4.1.5rc2
Open MPI repo revision: v4.1.5rc1-16-g5980bac633

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git克隆的

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

  • Operating system/version:
  • Computer hardware:
  • Network type:
    Architecture: x86_64
    CPU op-mode(s): 32-bit, 64-bit
    Byte Order: Little Endian
    CPU(s): 36
    On-line CPU(s) list: 0-35
    Thread(s) per core: 1
    Core(s) per socket: 18
    Socket(s): 2
    NUMA node(s): 2
    Vendor ID: GenuineIntel
    CPU family: 6
    Model: 85
    Model name: Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
    Stepping: 7
    CPU MHz: 999.914
    CPU max MHz: 3900.0000
    CPU min MHz: 1000.0000
    BogoMIPS: 5200.00
    Virtualization: VT-x
    L1d cache: 32K
    L1i cache: 32K
    L2 cache: 1024K
    L3 cache: 25344K
    NUMA node0 CPU(s): 0-17
    NUMA node1 CPU(s): 18-35

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

Currently, I have a computer cluster. I want to conduct communication by combining OpenMPI with RDMA. Here, I have configured UCX to support OpenMPI and set the communication modes of UCX as RC (Reliable Connected) and UD (Unreliable Datagram). I also set UCX_NET_DEVICES = mlx5_0. At present, local RDMA communication has been achieved. However, when I configure the host file and try to implement cross-node communication combining OpenMPI with RDMA, error messages will be reported.

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

shell$  mpirun -np 4 --mca pml ucx --hostfile host ./mpi_rdma_test

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions