zio: lock parent zios when updating wait counts on reexecute #17016

robn · 2025-02-01T03:25:04Z

[Sponsors: Klara, Inc., Wasabi Technology, Inc.]

Motivation and Context

While stress-testing some new code related to chaining writes and flushes (coming soon), I found that on resuming the pool after suspend, I could semi-regularly trip the counter assertion in zio_notify_parent():

[ 3622.735193] WARNING: Pool 'pool-655033360322456518' was suspended and is being resumed. Failed I/O will be retried.
[ 3622.772781] VERIFY3(*countp > 0) failed (0 > 0)
[ 3622.773982] PANIC at zio.c:827:zio_notify_parent()

After weeks of trying to track it down, I eventually concluded that this was not a bug in my own code, but rather, an existing increment race. This commit fixes it.

Description

As zios are reexecuted after resume from suspension, their ready and wait states need to be propagated to wait counts on all their parents.

It's possible for those parents to have active children passing through READY or DONE, which then end up in zio_notify_parent(), take their parent's lock, and decrement the wait count. Without also taking a lock here, it's possible for an increment race to occur, which leads to either there being no references left (tripping the assert in zio_notify_parent()), or a parent waiting forever for a nonexistent child to complete.

To protect against this, we simply take the appropriate zio locks in zio_reexecute() before updating the wait counts.

Discussion

Unfortunately, I can't reproduce this on stock OpenZFS. With my flushing work, which (for reasons) changes the way leaf writes are waited for and so their timing, I run a heavy write stress test, forcing a pool suspend every 5-10 minutes and resuming it. On that load, I trip this on average every ~15th resume.

I don't entirely understand the shape of the ZIO tree, but it appears to be that as we recursively reexecute ZIOs, they create their own children which execute and can reach READY and zio_notify_parent() while the rexecute recursion is still running. If it lines up just right, the zio_notify_parent() can be holding its parent lock and decrementing *countp at the same time we start reexecuting another child of that parent, and so are updating the parent's counters. We don't have the lock, so we race.

(ugh, I think I just restated the description).

Anyway, with this change, my work in progress ran for hours overnight (~13 hours) without a single hit, which I have never achieved before.

It's possible that it is my work causing the problem, though I don't see how - it doesn't directly touch the suspend/resume paths, and I've written it two different ways to rule out a refcounting bug in the first version. I know it's hard for you to judge this without seeing it. Still, I think this is change is intuitively correct - we are reexecuting the tree one ZIO at a time, and everywhere else we mess with the parent counts, we take the parent lock. I'm not concerned about contention, as zio_reexecute() is only called when resuming after suspend, or when retrying failed IO; these are both fairly rare situations where we're trying to save the furniture; performance is not a concern at this times.

How Has This Been Tested?

Full ZTS run completed.

Local testing as above.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

As zios are reexecuted after resume from suspension, their ready and wait states need to be propagated to wait counts on all their parents. It's possible for those parents to have active children passing through READY or DONE, which then end up in zio_notify_parent(), take their parent's lock, and decrement the wait count. Without also taking a lock here, it's possible for an increment race to occur, which leads to either there being no references left (tripping the assert in zio_notify_parent()), or a parent waiting forever for a nonexistent child to complete. To protect against this, we simply take the appropriate zio locks in zio_reexecute() before updating the wait counts. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>

As zios are reexecuted after resume from suspension, their ready and wait states need to be propagated to wait counts on all their parents. It's possible for those parents to have active children passing through READY or DONE, which then end up in zio_notify_parent(), take their parent's lock, and decrement the wait count. Without also taking a lock here, it's possible for an increment race to occur, which leads to either there being no references left (tripping the assert in zio_notify_parent()), or a parent waiting forever for a nonexistent child to complete. To protect against this, we simply take the appropriate zio locks in zio_reexecute() before updating the wait counts. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17016

allanjude approved these changes Feb 1, 2025

View reviewed changes

amotin approved these changes Feb 3, 2025

View reviewed changes

amotin added the Status: Accepted Ready to integrate (reviewed, tested) label Feb 3, 2025

amotin merged commit 390f6c1 into openzfs:master Feb 4, 2025
22 of 25 checks passed

robn mentioned this pull request Feb 18, 2025

Ensure all writes are flushed to disk #17065

Closed

13 tasks

robn mentioned this pull request Apr 11, 2025

config: fix ZFS_LINUX_TEST_RESULT_SYMBOL with --enable-linux-builtin #17236

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

zio: lock parent zios when updating wait counts on reexecute #17016

zio: lock parent zios when updating wait counts on reexecute #17016

Uh oh!

robn commented Feb 1, 2025

Uh oh!

Uh oh!

Uh oh!

zio: lock parent zios when updating wait counts on reexecute #17016

zio: lock parent zios when updating wait counts on reexecute #17016

Uh oh!

Conversation

robn commented Feb 1, 2025

Motivation and Context

Description

Discussion

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

Uh oh!

Uh oh!