Skip to content

libnbc nonuniform type failures in ibcast, iallgather(v) #2256

Closed
@jjhursey

Description

@jjhursey

When testing coll/libnbc with nonuniform data types three of the algorithms failed. I'll post a link to the test case in the comments. The unit test only works with -np 5 some of the tests pass fine with -np 4.

  • ibcast Falls into an infinite error loop in libnbc
  • iallgather and iallgatherv produce wrong answers. Might be the same underlying problem.

ibcast Failure

Note: This test will enter an infinite loop displaying MPI Error in MPI_Testall() (18) until PR #2245 is resolved.

shell$ mpirun -np 5 --mca coll ^hcoll  ./test-nbc-dt 0 
 0 /  5) Running MPI_Ibcast...
 1 /  5) Running MPI_Ibcast...
 2 /  5) Running MPI_Ibcast...
 3 /  5) Running MPI_Ibcast...
 4 /  5) Running MPI_Ibcast...
MPI Error in MPI_Testall() (18)
MPI Error in MPI_Testall() (18)
...

iallgather Failure

shell$ mpirun -np 5 --mca coll ^hcoll  ./test-nbc-dt 1
 0 /  5) Running MPI_Iallgather...
 1 /  5) Running MPI_Iallgather...
 2 /  5) Running MPI_Iallgather...
 3 /  5) Running MPI_Iallgather...
 4 /  5) Running MPI_Iallgather...
 3 /  5) buf[1] : actual 2 [2], expected 1 [1]. from 0
 3 /  5) Failed in chkbuf. err = -1
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode -2.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
 2 /  5) buf[1] : actual 2 [2], expected 1 [1]. from 0
 2 /  5) Failed in chkbuf. err = -6
 1 /  5) buf[2] : actual 2 [2], expected 1 [1]. from 0
 1 /  5) Failed in chkbuf. err = -2
 4 /  5) buf[1] : actual 2 [2], expected 1 [1]. from 0
 4 /  5) Failed in chkbuf. err = -6

iallgatherv Failure

shell$ mpirun -np 5 --mca coll ^hcoll  ./test-nbc-dt 2
 0 /  5) Running MPI_Iallgatherv...
 1 /  5) Running MPI_Iallgatherv...
 2 /  5) Running MPI_Iallgatherv...
 3 /  5) Running MPI_Iallgatherv...
 4 /  5) Running MPI_Iallgatherv...
 0 /  5) buf[1] : actual 3 [3], expected 2 [2]. from 1
 0 /  5) Failed in chkbuf. err = -6
 3 /  5) buf[1] : actual 3 [3], expected 2 [2]. from 1
 3 /  5) Failed in chkbuf. err = -6
 4 /  5) buf[1] : actual 3 [3], expected 2 [2]. from 1
 4 /  5) Failed in chkbuf. err = -6
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode -2.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
 2 /  5) buf[1] : actual 3 [3], expected 2 [2]. from 1
 2 /  5) Failed in chkbuf. err = -6

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions