Skip to content

MTL OFI: Fix Deadlock in fi_cancel given completion during cancel #5499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 7, 2018

Conversation

nrspruit
Copy link
Contributor

  • If a message for a recv that is being cancelled gets completed after
    the call to fi_cancel, then the OFI mtl will enter a deadlock state
    waiting for ofi_req->super.ompi_req->req_status._cancelled which will
    never happen since the recv was successfully finished.

  • To resolve this issue, the OFI mtl now checks ofi_req->req_started
    to see if the request has been started within the loop waiting for the
    event to be cancelled. If the request is being completed, then the loop
    is broken and fi_cancel exits setting
    ofi_req->super.ompi_req->req_status._cancelled = false;

Signed-off-by: Spruit, Neil R [email protected]
(cherry picked from commit 767135c)

- If a message for a recv that is being cancelled gets completed after
the call to fi_cancel, then the OFI mtl will enter a deadlock state
waiting for ofi_req->super.ompi_req->req_status._cancelled which will
never happen since the recv was successfully finished.

- To resolve this issue, the OFI mtl now checks ofi_req->req_started
to see if the request has been started within the loop waiting for the
event to be cancelled. If the request is being completed, then the loop
is broken and fi_cancel exits setting
ofi_req->super.ompi_req->req_status._cancelled = false;

Signed-off-by: Spruit, Neil R <[email protected]>
(cherry picked from commit 767135c)
@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@rhc54
Copy link
Contributor

rhc54 commented Jul 31, 2018

ok to test

@aravindksg
Copy link
Contributor

@hppritcha : Could you please take a look. This is a cherry-pick of PR #5476

@rhc54 rhc54 requested a review from hppritcha August 1, 2018 13:44
@rhc54 rhc54 added this to the v4.0.0 milestone Aug 1, 2018
@hppritcha hppritcha merged commit 9a6f6e6 into open-mpi:v4.0.x Aug 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants