Skip to content

Hangs on failures on master #1379

Closed
Closed
@jsquyres

Description

@jsquyres

In the Cisco MTT cluster, we're seeing a large amount of hangs on tests that are supposed to fail (e.g., they call MPI_ABORT). Specifically, test MPI processes do not die, even if their HNP and local orted are gone. The MPI processes keep spinning and consuming CPU cycles.

I'm seeing this across a variety of configure command line options. I.e., it doesn't seem to be specific to a single problematic configure option.

It looks like the hangs are of two flavors:

  1. An MPI process is stuck in an MPI collective that never completes
  2. An MPI process is stuck in a PMIX collective

The Intel test MPI_Abort_c is an example of case 1. In this test, MPI_COMM_WORLD rank 0 calls MPI_ABORT, and everyone else calls an MPI_ALLREDUCE.

It looks like the MCW rank 0 process is gone/dead, and all the others are stuck in the MPI_ALLREDUCE. The HNP and local orted is gone, too. I.e., somehow the RTE thread in the MPI processes somehow didn't kill these processes either when they got the abort signal, or the HNP / local orted went away.

I see the same pattern in the IBM test environment/abort: MCW 0 calls abort, everyone else calls sleep. In this case, MCW 0 and the HNP and the local orted are all gone, but all the other processes are stuck looping in sleep().


The Intel test MPI_Errhandler_fatal_f is an example of case 2. In this test, processes don't seem to get past MPI_INIT:

#0  0x0000003ca8caccdd in nanosleep () from /lib64/libc.so.6  
#1  0x0000003ca8ce1e54 in usleep () from /lib64/libc.so.6 
#2  0x00002aaaac3ec99e in OPAL_PMIX_PMIX112_PMIx_Fence ()
   from /home/mpiteam/scratches/community/2016-02-17cron/owk4/installs/QD8g/install/lib/libopen-pal.so.0
#3  0x00002aaaac3cccee in pmix1_fence ()
   from /home/mpiteam/scratches/community/2016-02-17cron/owk4/installs/QD8g/install/lib/libopen-pal.so.0 
#4  0x00002aaaab4f1ab6 in ompi_mpi_init ()
   from /home/mpiteam/scratches/community/2016-02-17cron/owk4/installs/QD8g/install/lib/libmpi.so.0 
#5  0x00002aaaab527167 in PMPI_Init ()
   from /home/mpiteam/scratches/community/2016-02-17cron/owk4/installs/QD8g/install/lib/libmpi.so.0  
#6  0x00002aaaab25b602 in pmpi_init__ ()
   from /home/mpiteam/scratches/community/2016-02-17cron/owk4/installs/QD8g/install/lib/libmpi_mpifh.so.0 
#7  0x0000000000401744 in MAIN__ ()

I see a bunch of tests like this (hung in MPI_INIT) -- not just Fortran tests, and not just tests that are supposed to fail. In these cases, it looks like a server gets overloaded with CPU load and things start slowing down, and then even positive tests start getting stuck in the PMIX fence in MPI_INIT (i.e., not just tests that are supposed to fail).


I've also seen similar stack traces where PMIX is stuck on a fence, but in MPI_FINALIZE. E.g., in the t_winerror test:

(gdb) bt
#0  0x0000003ca8caccdd in nanosleep () from /lib64/libc.so.6
#1  0x0000003ca8ce1e54 in usleep () from /lib64/libc.so.6
#2  0x00002aaaab60988e in OPAL_PMIX_PMIX112_PMIx_Fence ()
   from /home/mpiteam/scratches/usnic/2016-02-17cron/6reD/installs/4xnK/install/lib/libopen-pal.so.0
#3  0x00002aaaab5e9bde in pmix1_fence ()
   from /home/mpiteam/scratches/usnic/2016-02-17cron/6reD/installs/4xnK/install/lib/libopen-pal.so.0
#4  0x00002aaaaab306c5 in ompi_mpi_finalize ()
   from /home/mpiteam/scratches/usnic/2016-02-17cron/6reD/installs/4xnK/install/lib/libmpi.so.0
#5  0x00002aaaaab5a1c1 in PMPI_Finalize ()
   from /home/mpiteam/scratches/usnic/2016-02-17cron/6reD/installs/4xnK/install/lib/libmpi.so.0
#6  0x0000000000401cc4 in main ()

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions