Skip to content

Hangs in mca_btl_vader_component_progress on multiple archs #5638

Closed
@amckinstry

Description

@amckinstry

Background information

OpenMPI 3.1.2
PMIX 3.0.1

Installed on Debian /sid.

Testing across our suite of MPI programs, we're seeing hangs on some apps, it looks offhand like the common factor is 32-bit systems: i386, mipsel.

Debian bugs:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=905418
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=907267
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=907407

This is on simple 2-core systems for the most part. I've a straightforward reproducible case in a VM here on i386 with ARPACK.
The backtraces look like

#0  0xb7f43d09 in __kernel_vsyscall ()
#1  0xb7f43986 in __vdso_clock_gettime ()
#2  0xb7111af1 in __GI___clock_gettime (clock_id=1, tp=0xbffc7f54) at ../sysdeps/unix/clock_gettime.c:115
#3  0xb6e9edd7 in ?? () from /usr/lib/i386-linux-gnu/libopen-pal.so.40
#4  0xb6e4d3f8 in opal_progress () from /usr/lib/i386-linux-gnu/libopen-pal.so.40
#5  0xb76474a5 in ompi_request_default_wait () from /usr/lib/i386-linux-gnu/libmpi.so.40
#6  0xb769d0a5 in ompi_coll_base_sendrecv_actual () from /usr/lib/i386-linux-gnu/libmpi.so.40
#7  0xb769d4a9 in ompi_coll_base_allreduce_intra_recursivedoubling () from /usr/lib/i386-linux-gnu/libmpi.so.40
#8  0xb4a39860 in ompi_coll_tuned_allreduce_intra_dec_fixed () from /usr/lib/i386-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so
#9  0xb765cf9b in PMPI_Allreduce () from /usr/lib/i386-linux-gnu/libmpi.so.40
#10 0xb774de78 in pmpi_allreduce__ () from /usr/lib/i386-linux-gnu/libmpi_mpifh.so.40
#11 0xb7f0acc3 in pdnorm2 (comm=0, n=50, x=..., inc=1) at pdnorm2.f:79
#12 0xb7f0614d in pdsaitr (comm=0, ido=1, bmat=..., n=50, k=0, np=4, mode=1, resid=..., rnorm=1.7112505600240149, v=..., ldv=256, h=..., ldh=20, ipntr=...,
    workd=..., workl=..., info=0, _bmat=1) at pdsaitr.f:776
#13 0xb7f0749d in pdsaup2 (comm=0, ido=1, bmat=..., n=50, which=..., nev=4, np=16, tol=1.1102230246251565e-16, resid=..., mode=1, iupd=1, ishift=1,
    mxiter=300, v=..., ldv=256, h=..., ldh=20, ritz=..., bounds=..., q=..., ldq=20, workl=..., ipntr=..., workd=..., info=0, _bmat=1, _which=2)
    at pdsaup2.f:391
#14 0xb7f0813e in pdsaupd (comm=0, ido=1, bmat=..., n=50, which=..., nev=4, tol=1.1102230246251565e-16, resid=..., ncv=20, v=..., ldv=256, iparam=...,
    ipntr=..., workd=..., workl=..., lworkl=560, info=0, _bmat=1, _which=2) at pdsaupd.f:630
#15 0x004f6977 in parnoldi (comm=0) at issue46.f:200
#16 0x004f758f in issue46 () at issue46.f:21
#17 0x004f61ec in main (argc=1, argv=0xbffce86f) at issue46.f:23
#18 0xb70259a1 in __libc_start_main (main=0x4f61b0 <main>, argc=1, argv=0xbffcd7d4, init=0x4f75d0 <__libc_csu_init>, fini=0x4f7630 <__libc_csu_fini>,
    rtld_fini=0xb7f54f60 <_dl_fini>, stack_end=0xbffcd7cc) at ../csu/libc-start.c:310
#19 0x004f622c in _start ()

and


#0  0xb7fced09 in __kernel_vsyscall ()
#1  0xb7fce986 in __vdso_clock_gettime ()
#2  0xb719caf1 in __GI___clock_gettime (clock_id=1, tp=0xbf82ef84) at ../sysdeps/unix/clock_gettime.c:115
#3  0xb6f29dd7 in ?? () from /usr/lib/i386-linux-gnu/libopen-pal.so.40
#4  0xb6ed83f8 in opal_progress () from /usr/lib/i386-linux-gnu/libopen-pal.so.40
#5  0xb4ae5b1d in mca_pml_ob1_recv () from /usr/lib/i386-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so
#6  0xb77095a8 in PMPI_Recv () from /usr/lib/i386-linux-gnu/libmpi.so.40
#7  0xb77e20a5 in pmpi_recv__ () from /usr/lib/i386-linux-gnu/libmpi_mpifh.so.40
#8  0x0048e6a1 in av (comm=0, nloc=50, nx=10, mv_buf=..., v=..., w=...) at issue46.f:436
#9  0x0048e92a in parnoldi (comm=0) at issue46.f:215
#10 0x0048f58f in issue46 () at issue46.f:21
#11 0x0048e1ec in main (argc=1, argv=0xbf83586f) at issue46.f:23
#12 0xb70b09a1 in __libc_start_main (main=0x48e1b0 <main>, argc=1, argv=0xbf834264, init=0x48f5d0 <__libc_csu_init>, fini=0x48f630 <__libc_csu_fini>,
    rtld_fini=0xb7fdff60 <_dl_fini>, stack_end=0xbf83425c) at ../csu/libc-start.c:310
#13 0x0048e22c in _start ()

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions