-
Notifications
You must be signed in to change notification settings - Fork 1.9k
zio: lock parent zios when updating wait counts on reexecute #17016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
As zios are reexecuted after resume from suspension, their ready and wait states need to be propagated to wait counts on all their parents. It's possible for those parents to have active children passing through READY or DONE, which then end up in zio_notify_parent(), take their parent's lock, and decrement the wait count. Without also taking a lock here, it's possible for an increment race to occur, which leads to either there being no references left (tripping the assert in zio_notify_parent()), or a parent waiting forever for a nonexistent child to complete. To protect against this, we simply take the appropriate zio locks in zio_reexecute() before updating the wait counts. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
allanjude
approved these changes
Feb 1, 2025
amotin
approved these changes
Feb 3, 2025
arter97
pushed a commit
to arter97/zfs
that referenced
this pull request
Feb 4, 2025
As zios are reexecuted after resume from suspension, their ready and wait states need to be propagated to wait counts on all their parents. It's possible for those parents to have active children passing through READY or DONE, which then end up in zio_notify_parent(), take their parent's lock, and decrement the wait count. Without also taking a lock here, it's possible for an increment race to occur, which leads to either there being no references left (tripping the assert in zio_notify_parent()), or a parent waiting forever for a nonexistent child to complete. To protect against this, we simply take the appropriate zio locks in zio_reexecute() before updating the wait counts. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17016
13 tasks
ixhamza
pushed a commit
to truenas/zfs
that referenced
this pull request
Feb 19, 2025
As zios are reexecuted after resume from suspension, their ready and wait states need to be propagated to wait counts on all their parents. It's possible for those parents to have active children passing through READY or DONE, which then end up in zio_notify_parent(), take their parent's lock, and decrement the wait count. Without also taking a lock here, it's possible for an increment race to occur, which leads to either there being no references left (tripping the assert in zio_notify_parent()), or a parent waiting forever for a nonexistent child to complete. To protect against this, we simply take the appropriate zio locks in zio_reexecute() before updating the wait counts. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17016
ixhamza
pushed a commit
to ixhamza/zfs
that referenced
this pull request
Feb 25, 2025
As zios are reexecuted after resume from suspension, their ready and wait states need to be propagated to wait counts on all their parents. It's possible for those parents to have active children passing through READY or DONE, which then end up in zio_notify_parent(), take their parent's lock, and decrement the wait count. Without also taking a lock here, it's possible for an increment race to occur, which leads to either there being no references left (tripping the assert in zio_notify_parent()), or a parent waiting forever for a nonexistent child to complete. To protect against this, we simply take the appropriate zio locks in zio_reexecute() before updating the wait counts. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17016
ixhamza
pushed a commit
to truenas/zfs
that referenced
this pull request
Feb 25, 2025
As zios are reexecuted after resume from suspension, their ready and wait states need to be propagated to wait counts on all their parents. It's possible for those parents to have active children passing through READY or DONE, which then end up in zio_notify_parent(), take their parent's lock, and decrement the wait count. Without also taking a lock here, it's possible for an increment race to occur, which leads to either there being no references left (tripping the assert in zio_notify_parent()), or a parent waiting forever for a nonexistent child to complete. To protect against this, we simply take the appropriate zio locks in zio_reexecute() before updating the wait counts. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17016
ixhamza
pushed a commit
to ixhamza/zfs
that referenced
this pull request
Feb 27, 2025
As zios are reexecuted after resume from suspension, their ready and wait states need to be propagated to wait counts on all their parents. It's possible for those parents to have active children passing through READY or DONE, which then end up in zio_notify_parent(), take their parent's lock, and decrement the wait count. Without also taking a lock here, it's possible for an increment race to occur, which leads to either there being no references left (tripping the assert in zio_notify_parent()), or a parent waiting forever for a nonexistent child to complete. To protect against this, we simply take the appropriate zio locks in zio_reexecute() before updating the wait counts. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17016
pcd1193182
pushed a commit
to KlaraSystems/zfs
that referenced
this pull request
Mar 24, 2025
As zios are reexecuted after resume from suspension, their ready and wait states need to be propagated to wait counts on all their parents. It's possible for those parents to have active children passing through READY or DONE, which then end up in zio_notify_parent(), take their parent's lock, and decrement the wait count. Without also taking a lock here, it's possible for an increment race to occur, which leads to either there being no references left (tripping the assert in zio_notify_parent()), or a parent waiting forever for a nonexistent child to complete. To protect against this, we simply take the appropriate zio locks in zio_reexecute() before updating the wait counts. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17016
13 tasks
lundman
pushed a commit
to openzfsonosx/openzfs-fork
that referenced
this pull request
Jun 2, 2025
As zios are reexecuted after resume from suspension, their ready and wait states need to be propagated to wait counts on all their parents. It's possible for those parents to have active children passing through READY or DONE, which then end up in zio_notify_parent(), take their parent's lock, and decrement the wait count. Without also taking a lock here, it's possible for an increment race to occur, which leads to either there being no references left (tripping the assert in zio_notify_parent()), or a parent waiting forever for a nonexistent child to complete. To protect against this, we simply take the appropriate zio locks in zio_reexecute() before updating the wait counts. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17016
lundman
pushed a commit
to openzfsonosx/openzfs-fork
that referenced
this pull request
Jun 2, 2025
As zios are reexecuted after resume from suspension, their ready and wait states need to be propagated to wait counts on all their parents. It's possible for those parents to have active children passing through READY or DONE, which then end up in zio_notify_parent(), take their parent's lock, and decrement the wait count. Without also taking a lock here, it's possible for an increment race to occur, which leads to either there being no references left (tripping the assert in zio_notify_parent()), or a parent waiting forever for a nonexistent child to complete. To protect against this, we simply take the appropriate zio locks in zio_reexecute() before updating the wait counts. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17016
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
[Sponsors: Klara, Inc., Wasabi Technology, Inc.]
Motivation and Context
While stress-testing some new code related to chaining writes and flushes (coming soon), I found that on resuming the pool after suspend, I could semi-regularly trip the counter assertion in
zio_notify_parent()
:After weeks of trying to track it down, I eventually concluded that this was not a bug in my own code, but rather, an existing increment race. This commit fixes it.
Description
As zios are reexecuted after resume from suspension, their ready and wait states need to be propagated to wait counts on all their parents.
It's possible for those parents to have active children passing through READY or DONE, which then end up in
zio_notify_parent()
, take their parent's lock, and decrement the wait count. Without also taking a lock here, it's possible for an increment race to occur, which leads to either there being no references left (tripping the assert inzio_notify_parent()
), or a parent waiting forever for a nonexistent child to complete.To protect against this, we simply take the appropriate zio locks in
zio_reexecute()
before updating the wait counts.Discussion
Unfortunately, I can't reproduce this on stock OpenZFS. With my flushing work, which (for reasons) changes the way leaf writes are waited for and so their timing, I run a heavy write stress test, forcing a pool suspend every 5-10 minutes and resuming it. On that load, I trip this on average every ~15th resume.
I don't entirely understand the shape of the ZIO tree, but it appears to be that as we recursively reexecute ZIOs, they create their own children which execute and can reach READY and
zio_notify_parent()
while the rexecute recursion is still running. If it lines up just right, thezio_notify_parent()
can be holding its parent lock and decrementing*countp
at the same time we start reexecuting another child of that parent, and so are updating the parent's counters. We don't have the lock, so we race.(ugh, I think I just restated the description).
Anyway, with this change, my work in progress ran for hours overnight (~13 hours) without a single hit, which I have never achieved before.
It's possible that it is my work causing the problem, though I don't see how - it doesn't directly touch the suspend/resume paths, and I've written it two different ways to rule out a refcounting bug in the first version. I know it's hard for you to judge this without seeing it. Still, I think this is change is intuitively correct - we are reexecuting the tree one ZIO at a time, and everywhere else we mess with the parent counts, we take the parent lock. I'm not concerned about contention, as
zio_reexecute()
is only called when resuming after suspend, or when retrying failed IO; these are both fairly rare situations where we're trying to save the furniture; performance is not a concern at this times.How Has This Been Tested?
Full ZTS run completed.
Local testing as above.
Types of changes
Checklist:
Signed-off-by
.