Skip to content

osu_latency tests with CUDA segfault after mpi_memory_alloc_kinds is introduced #13096

Closed
@jiaxiyan

Description

@jiaxiyan

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

main branch

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

build main branch from source

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

08e41ed 3rd-party/openpmix (v1.1.3-4067-g08e41ed5)
30cadc6746ebddd69ea42ca78b964398f782e4e3 3rd-party/prrte (psrvr-v2.0.0rc1-4839-g30cadc6746)
dfff67569fb72dbf8d73a1dcf74d091dad93f71b config/oac (dfff675)

Please describe the system on which you are running

  • Operating system/version: Amazon Linux2
  • Computer hardware: p4d.24xlarge
  • Network type: Elastic Fabric Adapter

Details of the problem

osu-micro-benchmarks cuda tests are failing with segfault since #13055 is merged

mpirun --wdir . -n 2 --hostfile hostfile --map-by ppr:2:node --timeout 1800 -x LD_LIBRARY_PATH=/opt/amazon/efa/lib64 -x PATH  /home/osu-micro-benchmarks/mpi/pt2pt/osu_latency  --buffer-num multiple -d cuda H D

2025-02-12 18:03:27,068 - INFO - utils - mpirun output:
# OSU MPI-CUDA Latency Test
# Send Buffer on HOST (H) and Receive Buffer on DEVICE (D)
# Size          Latency (us)
0                       0.65
[ip-172-31-17-116:33408] *** Process received signal ***
[ip-172-31-17-116:33408] Signal: Segmentation fault (11)
[ip-172-31-17-116:33408] Signal code: Invalid permissions (2)
[ip-172-31-17-116:33408] Failing at address: 0x7f1303600000
[ip-172-31-17-116:33408] [ 0] /lib64/libpthread.so.0(+0x118e0)[0x7f133b5258e0]
[ip-172-31-17-116:33408] [ 1] /lib64/libc.so.6(+0x14dbeb)[0x7f133b2b4beb]
[ip-172-31-17-116:33408] [ 2] /opt/amazon/efa/lib64/libfabric.so.1(+0x1f672)[0x7f12e78cc672]
[ip-172-31-17-116:33408] [ 3] /opt/amazon/efa/lib64/libfabric.so.1(+0x1f627)[0x7f12e78cc627]
....
[ip-172-31-17-116:33408] *** End of error message ***
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------

The backtrace shows segfault comes from memcpy attempting to copy 1 byte from an inaccessible memory address.

(gdb) bt
#0  0x00007f91a9139be8 in __memmove_avx_unaligned_erms () from /lib64/libc.so.6
#1  0x00007f915561432f in ofi_memcpy (device=0, dest=0x7f9145412da0, src=0x7f9173200000, size=1)
    at ./include/ofi_hmem.h:263
#2  0x00007f91556144eb in ofi_copy_from_hmem (iface=FI_HMEM_SYSTEM, device=0, dest=0x7f9145412da0, src=0x7f9173200000,
    size=1) at ./include/ofi_hmem.h:405
#3  0x00007f9155614eb6 in ofi_copy_mr_iov (mr=0x0, iov=0x7ffd8b06c5f0, iov_count=1, offset=0, buf=0x7f9145412da0,
    size=191, dir=0) at src/hmem.c:458
#4  0x00007f9155614f53 in ofi_copy_from_mr_iov (dest=0x7f9145412da0, size=192, mr=0x0, iov=0x7ffd8b06c5f0, iov_count=1,
    iov_offset=0) at src/hmem.c:473
#5  0x00007f9155731e03 in smr_format_inline (cmd=0x7f9145412d60, mr=0x0, iov=0x7ffd8b06c5f0, count=1)
    at prov/shm/src/smr_ep.c:277
#6  0x00007f9155732e20 in smr_do_inline (ep=0x20ce8f20, peer_smr=0x7f9145396000, id=1, peer_id=0, op=1, tag=1, data=0,
    op_flags=131072, desc=0x0, iov=0x7ffd8b06c5f0, iov_count=1, total_len=1, context=0x0, cmd=0x7f9145412d60)
    at prov/shm/src/smr_ep.c:647
#7  0x00007f915572b559 in smr_generic_inject (ep_fid=0x20ce8f20, buf=0x7f9173200000, len=1, dest_addr=1, tag=1, data=0,
    op=1, op_flags=131072) at prov/shm/src/smr_msg.c:214
#8  0x00007f915572bb75 in smr_tinjectdata (ep_fid=0x20ce8f20, buf=0x7f9173200000, len=1, data=0, dest_addr=1, tag=1)
    at prov/shm/src/smr_msg.c:394
#9  0x00007f91556c47fa in fi_tinjectdata (ep=0x20ce8f20, buf=0x7f9173200000, len=1, data=0, dest_addr=1, tag=1)
    at ./include/rdma/fi_tagged.h:149
#10 0x00007f91556c6c0d in efa_rdm_msg_tinjectdata (ep_fid=0x20ce83c0, buf=0x7f9173200000, len=1, data=0, dest_addr=1,
    tag=1) at prov/efa/src/rdm/efa_rdm_msg.c:594
#11 0x00007f9154103d5a in fi_tinjectdata (ep=0x20ce83c0, buf=0x7f9173200000, len=1, data=0, dest_addr=1, tag=1)
    at /home/ec2-user/PortaFiducia/build/libraries/libfabric/v1.22.x/install/libfabric/include/rdma/fi_tagged.h:149
#12 0x00007f915410c12e in ompi_mtl_ofi_send_generic (ofi_cq_data=true, mode=MCA_PML_BASE_SEND_STANDARD,
    convertor=0x7ffd8b06df60, tag=1, dest=1, comm=0x62e960 <ompi_mpi_comm_world>, mtl=0x7f9154335260 <ompi_mtl_ofi>)
    at mtl_ofi.h:937
#13 ompi_mtl_ofi_send_true (mtl=0x7f9154335260 <ompi_mtl_ofi>, comm=0x62e960 <ompi_mpi_comm_world>, dest=1, tag=1,
    convertor=0x7ffd8b06df60, mode=MCA_PML_BASE_SEND_STANDARD) at mtl_ofi_send_opt.c:38
#14 0x00007f9154985256 in mca_pml_cm_send (buf=0x7f9173200000, count=1, datatype=0x62df60 <ompi_mpi_char>, dst=1, tag=1,
    sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x62e960 <ompi_mpi_comm_world>) at pml_cm.h:347
#15 0x00007f91a98cecbd in PMPI_Send (buf=0x7f9173200000, count=1, type=0x62df60 <ompi_mpi_char>, dest=1, tag=1,
    comm=0x62e960 <ompi_mpi_comm_world>) at send.c:93
#16 0x00000000004029bc in main (argc=<optimized out>, argv=<optimized out>) at osu_latency.c:168

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions