Hangs on failures on master

In the Cisco MTT cluster, we're seeing a large amount of hangs on tests that are supposed to fail (e.g., they call MPI_ABORT).  Specifically, test MPI processes do not die, even if their HNP and local orted are gone.  The MPI processes keep spinning and consuming CPU cycles.

I'm seeing this across a variety of configure command line options.  I.e., it doesn't seem to be specific to a single problematic configure option.

It looks like the hangs are of two flavors:
1. An MPI process is stuck in an MPI collective that never completes
2. An MPI process is stuck in a PMIX collective

---

The Intel test MPI_Abort_c is an example of case 1.  In this test, MPI_COMM_WORLD rank 0 calls MPI_ABORT, and everyone else calls an MPI_ALLREDUCE.

It _looks_ like the MCW rank 0 process is gone/dead, and all the others are stuck in the MPI_ALLREDUCE.  The HNP and local orted is gone, too.  I.e., somehow the RTE thread in the MPI processes somehow didn't kill these processes either when they got the abort signal, or the HNP / local orted went away.

I see the same pattern in the IBM test environment/abort: MCW 0 calls abort, everyone else calls sleep.  In this case, MCW 0 and the HNP and the local orted are all gone, but all the other processes are stuck looping in sleep().

---

The Intel test MPI_Errhandler_fatal_f is an example of case 2.  In this test, processes don't seem to get past MPI_INIT:

```
#0  0x0000003ca8caccdd in nanosleep () from /lib64/libc.so.6  
#1  0x0000003ca8ce1e54 in usleep () from /lib64/libc.so.6 
#2  0x00002aaaac3ec99e in OPAL_PMIX_PMIX112_PMIx_Fence ()
   from /home/mpiteam/scratches/community/2016-02-17cron/owk4/installs/QD8g/install/lib/libopen-pal.so.0
#3  0x00002aaaac3cccee in pmix1_fence ()
   from /home/mpiteam/scratches/community/2016-02-17cron/owk4/installs/QD8g/install/lib/libopen-pal.so.0 
#4  0x00002aaaab4f1ab6 in ompi_mpi_init ()
   from /home/mpiteam/scratches/community/2016-02-17cron/owk4/installs/QD8g/install/lib/libmpi.so.0 
#5  0x00002aaaab527167 in PMPI_Init ()
   from /home/mpiteam/scratches/community/2016-02-17cron/owk4/installs/QD8g/install/lib/libmpi.so.0  
#6  0x00002aaaab25b602 in pmpi_init__ ()
   from /home/mpiteam/scratches/community/2016-02-17cron/owk4/installs/QD8g/install/lib/libmpi_mpifh.so.0 
#7  0x0000000000401744 in MAIN__ ()
```

I see a bunch of tests like this (hung in MPI_INIT) -- not just Fortran tests, and not just tests that are supposed to fail.  In these cases, it looks like a server gets overloaded with CPU load and things start slowing down, and then even positive tests start getting stuck in the PMIX fence in MPI_INIT (i.e., not just tests that are _supposed_ to fail).

---

I've also seen similar stack traces where PMIX is stuck on a fence, but in MPI_FINALIZE.  E.g., in the t_winerror test:

```
(gdb) bt
#0  0x0000003ca8caccdd in nanosleep () from /lib64/libc.so.6
#1  0x0000003ca8ce1e54 in usleep () from /lib64/libc.so.6
#2  0x00002aaaab60988e in OPAL_PMIX_PMIX112_PMIx_Fence ()
   from /home/mpiteam/scratches/usnic/2016-02-17cron/6reD/installs/4xnK/install/lib/libopen-pal.so.0
#3  0x00002aaaab5e9bde in pmix1_fence ()
   from /home/mpiteam/scratches/usnic/2016-02-17cron/6reD/installs/4xnK/install/lib/libopen-pal.so.0
#4  0x00002aaaaab306c5 in ompi_mpi_finalize ()
   from /home/mpiteam/scratches/usnic/2016-02-17cron/6reD/installs/4xnK/install/lib/libmpi.so.0
#5  0x00002aaaaab5a1c1 in PMPI_Finalize ()
   from /home/mpiteam/scratches/usnic/2016-02-17cron/6reD/installs/4xnK/install/lib/libmpi.so.0
#6  0x0000000000401cc4 in main ()
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hangs on failures on master #1379

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Hangs on failures on master #1379

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions