Closed
Description
When testing coll/libnbc
with nonuniform data types three of the algorithms failed. I'll post a link to the test case in the comments. The unit test only works with -np 5
some of the tests pass fine with -np 4
.
ibcast
Falls into an infinite error loop inlibnbc
iallgather
andiallgatherv
produce wrong answers. Might be the same underlying problem.
ibcast
Failure
Note: This test will enter an infinite loop displaying MPI Error in MPI_Testall() (18)
until PR #2245 is resolved.
shell$ mpirun -np 5 --mca coll ^hcoll ./test-nbc-dt 0
0 / 5) Running MPI_Ibcast...
1 / 5) Running MPI_Ibcast...
2 / 5) Running MPI_Ibcast...
3 / 5) Running MPI_Ibcast...
4 / 5) Running MPI_Ibcast...
MPI Error in MPI_Testall() (18)
MPI Error in MPI_Testall() (18)
...
iallgather
Failure
shell$ mpirun -np 5 --mca coll ^hcoll ./test-nbc-dt 1
0 / 5) Running MPI_Iallgather...
1 / 5) Running MPI_Iallgather...
2 / 5) Running MPI_Iallgather...
3 / 5) Running MPI_Iallgather...
4 / 5) Running MPI_Iallgather...
3 / 5) buf[1] : actual 2 [2], expected 1 [1]. from 0
3 / 5) Failed in chkbuf. err = -1
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode -2.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
2 / 5) buf[1] : actual 2 [2], expected 1 [1]. from 0
2 / 5) Failed in chkbuf. err = -6
1 / 5) buf[2] : actual 2 [2], expected 1 [1]. from 0
1 / 5) Failed in chkbuf. err = -2
4 / 5) buf[1] : actual 2 [2], expected 1 [1]. from 0
4 / 5) Failed in chkbuf. err = -6
iallgatherv
Failure
shell$ mpirun -np 5 --mca coll ^hcoll ./test-nbc-dt 2
0 / 5) Running MPI_Iallgatherv...
1 / 5) Running MPI_Iallgatherv...
2 / 5) Running MPI_Iallgatherv...
3 / 5) Running MPI_Iallgatherv...
4 / 5) Running MPI_Iallgatherv...
0 / 5) buf[1] : actual 3 [3], expected 2 [2]. from 1
0 / 5) Failed in chkbuf. err = -6
3 / 5) buf[1] : actual 3 [3], expected 2 [2]. from 1
3 / 5) Failed in chkbuf. err = -6
4 / 5) buf[1] : actual 3 [3], expected 2 [2]. from 1
4 / 5) Failed in chkbuf. err = -6
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode -2.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
2 / 5) buf[1] : actual 3 [3], expected 2 [2]. from 1
2 / 5) Failed in chkbuf. err = -6