Skip to content

Kick vsock during VM resume #4796

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 12, 2024

Conversation

ShadowCurse
Copy link
Contributor

@ShadowCurse ShadowCurse commented Sep 11, 2024

Changes

Kick vsock when resuming VM. This ensures that if VM is restored from a snapshot, guest gets a notification to read a queue with TRANSPORT_RESET_EVENT inside. This will trigger guest to close all vsock connections. This is expected behavior as vsock devices does not support bringing connections over the snapshot boundary.

Reason

Fixes #4736

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • If a specific issue led to this PR, this PR closes the issue.
  • The description of changes is clear and encompassing.
  • Any required documentation changes (code and docs) are included in this
    PR.
  • API changes follow the Runbook for Firecracker API changes.
  • User-facing changes are mentioned in CHANGELOG.md.
  • All added/changed functionality is tested.
  • New TODOs link to an issue.
  • Commits meet
    contribution quality standards.

  • This functionality cannot be added in rust-vmm.

@ShadowCurse ShadowCurse added the Type: Fix Indicates a fix to existing code label Sep 11, 2024
@ShadowCurse ShadowCurse self-assigned this Sep 11, 2024
@ShadowCurse ShadowCurse force-pushed the kick_vsock branch 3 times, most recently from 661ae3b to 99a51f1 Compare September 11, 2024 17:31
Copy link

codecov bot commented Sep 11, 2024

Codecov Report

Attention: Patch coverage is 0% with 8 lines in your changes missing coverage. Please review.

Project coverage is 84.32%. Comparing base (2914d5a) to head (a51d09a).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
src/vmm/src/device_manager/mmio.rs 0.00% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4796      +/-   ##
==========================================
- Coverage   84.32%   84.32%   -0.01%     
==========================================
  Files         249      249              
  Lines       27519    27521       +2     
==========================================
  Hits        23206    23206              
- Misses       4313     4315       +2     
Flag Coverage Δ
5.10-c5n.metal 84.54% <0.00%> (-0.01%) ⬇️
5.10-m5n.metal 84.52% <0.00%> (-0.02%) ⬇️
5.10-m6a.metal 83.81% <0.00%> (-0.01%) ⬇️
5.10-m6g.metal 80.89% <0.00%> (-0.01%) ⬇️
5.10-m6i.metal 84.52% <0.00%> (+<0.01%) ⬆️
5.10-m7g.metal 80.89% <0.00%> (-0.01%) ⬇️
6.1-c5n.metal 84.53% <0.00%> (-0.02%) ⬇️
6.1-m5n.metal 84.52% <0.00%> (-0.02%) ⬇️
6.1-m6a.metal 83.81% <0.00%> (-0.01%) ⬇️
6.1-m6g.metal 80.89% <0.00%> (-0.01%) ⬇️
6.1-m6i.metal 84.51% <0.00%> (-0.02%) ⬇️
6.1-m7g.metal 80.89% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ShadowCurse ShadowCurse marked this pull request as ready for review September 12, 2024 08:50
@ShadowCurse ShadowCurse added Type: Bug Indicates an unexpected problem or unintended behavior Status: Awaiting review Indicates that a pull request is ready to be reviewed labels Sep 12, 2024
@ShadowCurse ShadowCurse requested a review from pb8o September 12, 2024 09:12
@ShadowCurse ShadowCurse changed the title Kick vsock Kick vsock during VM resume Sep 12, 2024
@ShadowCurse ShadowCurse force-pushed the kick_vsock branch 3 times, most recently from c89c074 to b0bd4cf Compare September 12, 2024 09:28
We need to kick vsock queue during resume
in order for a VM to process `TRANSPORT_RESET_EVENT`
we sent during snapshot creation. Otherwise it will
wait for it forever.

Signed-off-by: Egor Lazarchuk <[email protected]>
The test starts `socat` server on the host and `socat`
client in the guest. The test then validates that `socat`
client stops after snapshot is taken.

Signed-off-by: Egor Lazarchuk <[email protected]>
roypat
roypat previously approved these changes Sep 12, 2024
kalyazin
kalyazin previously approved these changes Sep 12, 2024
roypat
roypat previously approved these changes Sep 12, 2024
Add entry for vsock fix into CHANGELOG.

Signed-off-by: Egor Lazarchuk <[email protected]>
@ShadowCurse ShadowCurse merged commit d096817 into firecracker-microvm:main Sep 12, 2024
5 of 7 checks passed
@ShadowCurse ShadowCurse deleted the kick_vsock branch September 12, 2024 12:24
ShadowCurse added a commit to ShadowCurse/firecracker that referenced this pull request Sep 18, 2024
This is a fix for a fix introduced in firecracker-microvm#4796
The issue was in vsock device hanging after snapshot
restoration due to the guest not being notified about
the termination packet. But there was bug in the fix, maily
we saved the vsock state before the notification was sent,
thus discarding all modifications made to sent the notification.

The reason original fix worked, is because we were only testing
with 1 iteration of snap/restore. This way even though we lost
synchronization with the guest in the event queue state, it worked
fine once. But doing more iterations causes vsock to hang
as before.

This commit fixes the issue by storing vsock state after the
notification is sent and modifies the vsock test to run
multiple iterations of snap/restore.

Signed-off-by: Egor Lazarchuk <[email protected]>
ShadowCurse added a commit to ShadowCurse/firecracker that referenced this pull request Sep 18, 2024
This is a fix for a fix introduced in firecracker-microvm#4796
The issue was in vsock device hanging after snapshot
restoration due to the guest not being notified about
the termination packet. But there was bug in the fix, maily
we saved the vsock state before the notification was sent,
thus discarding all modifications made to sent the notification.

The reason original fix worked, is because we were only testing
with 1 iteration of snap/restore. This way even though we lost
synchronization with the guest in the event queue state, it worked
fine once. But doing more iterations causes vsock to hang
as before.

This commit fixes the issue by storing vsock state after the
notification is sent and modifies the vsock test to run
multiple iterations of snap/restore.

Signed-off-by: Egor Lazarchuk <[email protected]>
ShadowCurse added a commit to ShadowCurse/firecracker that referenced this pull request Sep 18, 2024
This is a fix for a fix introduced in firecracker-microvm#4796
The issue was in vsock device hanging after snapshot
restoration due to the guest not being notified about
the termination packet. But there was bug in the fix, maily
we saved the vsock state before the notification was sent,
thus discarding all modifications made to sent the notification.

The reason original fix worked, is because we were only testing
with 1 iteration of snap/restore. This way even though we lost
synchronization with the guest in the event queue state, it worked
fine once. But doing more iterations causes vsock to hang
as before.

This commit fixes the issue by storing vsock state after the
notification is sent and modifies the vsock test to run
multiple iterations of snap/restore.

Signed-off-by: Egor Lazarchuk <[email protected]>
ShadowCurse added a commit that referenced this pull request Sep 18, 2024
This is a fix for a fix introduced in #4796
The issue was in vsock device hanging after snapshot
restoration due to the guest not being notified about
the termination packet. But there was bug in the fix, maily
we saved the vsock state before the notification was sent,
thus discarding all modifications made to sent the notification.

The reason original fix worked, is because we were only testing
with 1 iteration of snap/restore. This way even though we lost
synchronization with the guest in the event queue state, it worked
fine once. But doing more iterations causes vsock to hang
as before.

This commit fixes the issue by storing vsock state after the
notification is sent and modifies the vsock test to run
multiple iterations of snap/restore.

Signed-off-by: Egor Lazarchuk <[email protected]>
RiverPhillips pushed a commit to RiverPhillips/firecracker that referenced this pull request Sep 20, 2024
This is a fix for a fix introduced in firecracker-microvm#4796
The issue was in vsock device hanging after snapshot
restoration due to the guest not being notified about
the termination packet. But there was bug in the fix, maily
we saved the vsock state before the notification was sent,
thus discarding all modifications made to sent the notification.

The reason original fix worked, is because we were only testing
with 1 iteration of snap/restore. This way even though we lost
synchronization with the guest in the event queue state, it worked
fine once. But doing more iterations causes vsock to hang
as before.

This commit fixes the issue by storing vsock state after the
notification is sent and modifies the vsock test to run
multiple iterations of snap/restore.

Signed-off-by: Egor Lazarchuk <[email protected]>
Signed-off-by: River Phillips <[email protected]>
kalyazin pushed a commit to kalyazin/firecracker that referenced this pull request Oct 8, 2024
This is a fix for a fix introduced in firecracker-microvm#4796
The issue was in vsock device hanging after snapshot
restoration due to the guest not being notified about
the termination packet. But there was bug in the fix, maily
we saved the vsock state before the notification was sent,
thus discarding all modifications made to sent the notification.

The reason original fix worked, is because we were only testing
with 1 iteration of snap/restore. This way even though we lost
synchronization with the guest in the event queue state, it worked
fine once. But doing more iterations causes vsock to hang
as before.

This commit fixes the issue by storing vsock state after the
notification is sent and modifies the vsock test to run
multiple iterations of snap/restore.

Signed-off-by: Egor Lazarchuk <[email protected]>
bchalios pushed a commit that referenced this pull request Oct 8, 2024
This is a fix for a fix introduced in #4796
The issue was in vsock device hanging after snapshot
restoration due to the guest not being notified about
the termination packet. But there was bug in the fix, maily
we saved the vsock state before the notification was sent,
thus discarding all modifications made to sent the notification.

The reason original fix worked, is because we were only testing
with 1 iteration of snap/restore. This way even though we lost
synchronization with the guest in the event queue state, it worked
fine once. But doing more iterations causes vsock to hang
as before.

This commit fixes the issue by storing vsock state after the
notification is sent and modifies the vsock test to run
multiple iterations of snap/restore.

Signed-off-by: Egor Lazarchuk <[email protected]>
ShadowCurse added a commit to ShadowCurse/firecracker that referenced this pull request Nov 5, 2024
This is a fix for a fix introduced in firecracker-microvm#4796
The issue was in vsock device hanging after snapshot
restoration due to the guest not being notified about
the termination packet. But there was bug in the fix, maily
we saved the vsock state before the notification was sent,
thus discarding all modifications made to sent the notification.

The reason original fix worked, is because we were only testing
with 1 iteration of snap/restore. This way even though we lost
synchronization with the guest in the event queue state, it worked
fine once. But doing more iterations causes vsock to hang
as before.

This commit fixes the issue by storing vsock state after the
notification is sent and modifies the vsock test to run
multiple iterations of snap/restore.

Signed-off-by: Egor Lazarchuk <[email protected]>
ShadowCurse added a commit to ShadowCurse/firecracker that referenced this pull request Nov 5, 2024
This is a fix for a fix introduced in firecracker-microvm#4796
The issue was in vsock device hanging after snapshot
restoration due to the guest not being notified about
the termination packet. But there was bug in the fix, maily
we saved the vsock state before the notification was sent,
thus discarding all modifications made to sent the notification.

The reason original fix worked, is because we were only testing
with 1 iteration of snap/restore. This way even though we lost
synchronization with the guest in the event queue state, it worked
fine once. But doing more iterations causes vsock to hang
as before.

This commit fixes the issue by storing vsock state after the
notification is sent and modifies the vsock test to run
multiple iterations of snap/restore.

Signed-off-by: Egor Lazarchuk <[email protected]>
ShadowCurse added a commit to ShadowCurse/firecracker that referenced this pull request Nov 5, 2024
This is a fix for a fix introduced in firecracker-microvm#4796
The issue was in vsock device hanging after snapshot
restoration due to the guest not being notified about
the termination packet. But there was bug in the fix, maily
we saved the vsock state before the notification was sent,
thus discarding all modifications made to sent the notification.

The reason original fix worked, is because we were only testing
with 1 iteration of snap/restore. This way even though we lost
synchronization with the guest in the event queue state, it worked
fine once. But doing more iterations causes vsock to hang
as before.

This commit fixes the issue by storing vsock state after the
notification is sent and modifies the vsock test to run
multiple iterations of snap/restore.

Signed-off-by: Egor Lazarchuk <[email protected]>
ShadowCurse added a commit to ShadowCurse/firecracker that referenced this pull request Nov 5, 2024
This is a fix for a fix introduced in firecracker-microvm#4796
The issue was in vsock device hanging after snapshot
restoration due to the guest not being notified about
the termination packet. But there was bug in the fix, maily
we saved the vsock state before the notification was sent,
thus discarding all modifications made to sent the notification.

The reason original fix worked, is because we were only testing
with 1 iteration of snap/restore. This way even though we lost
synchronization with the guest in the event queue state, it worked
fine once. But doing more iterations causes vsock to hang
as before.

This commit fixes the issue by storing vsock state after the
notification is sent and modifies the vsock test to run
multiple iterations of snap/restore.

Signed-off-by: Egor Lazarchuk <[email protected]>
CompuIves added a commit to codesandbox/firecracker that referenced this pull request Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Awaiting review Indicates that a pull request is ready to be reviewed Type: Bug Indicates an unexpected problem or unintended behavior Type: Fix Indicates a fix to existing code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Ongoing read Syscalls on VSocks Don't Get Interrupted After a Snapshot Resume
4 participants