Skip to content

Failure while mpirun job depends on the order of the hosts #4516

Closed
@karasevb

Description

@karasevb

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

v3.0.x
master

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone

./configure --prefix=`pwd`/install --enable-orterun-prefix-by-default --with-slurm --with-pmi --with-ucx

Please describe the system on which you are running

  • Operating system/version:
    RedHat 7.2
  • Computer hardware:
    Intel dual socket Broadwell
  • Network type:
    IB

Details of the problem

Running on nodes node1,node2 works well, but if change the order of the nodes to node2,node1 this will result to failure:

ssh node1
mpirun --bind-to core --map-by node -H node2,node1 -np 2  $HPCX_MPI_DIR/tests/osu-micro-benchmarks-5.3.2/osu_allreduce
--------------------------------------------------------------------------
[node2:13941] Error: pml_yalla.c:95 - recv_ep_address() Failed to receive EP address
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Not found" (-13) instead of "Success" (0)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions