Skip to content

zed: Ensure spare activation after kernel-initiated device removal #17187

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 28, 2025

Conversation

ixhamza
Copy link
Member

@ixhamza ixhamza commented Mar 27, 2025

Motivation and Context

zed fails to activate hotspare if a device is removed by the kernel.

Description

In addition to hotplug events, the kernel may also mark a failing vdev as REMOVED. This was observed in one of our customer report and reproduced by forcing the NVMe host driver to disable the device after a failed reset due to command timeout. In such cases, the spare was not activated because the device had already transitioned to a REMOVED state before zed processed the event.
To address this, explicitly attempt hot spare activation when the kernel marks a device as REMOVED.

How Has This Been Tested?

  • Explicitly making device removed by the kernel and validating that spare replaces the removed vdev.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

In addition to hotplug events, the kernel may also mark a failing vdev
as REMOVED. This was observed in a customer report and reproduced by
forcing the NVMe host driver to disable the device after a failed reset
due to command timeout. In such cases, the spare was not activated
because the device had already transitioned to a REMOVED state before
zed processed the event.
To address this, explicitly attempt hot spare activation when the
kernel marks a device as REMOVED.

Signed-off-by: Ameer Hamza <[email protected]>
Copy link
Contributor

@tonyhutter tonyhutter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code looks fine. I'm just waiting on the CI results

@amotin amotin added the Status: Accepted Ready to integrate (reviewed, tested) label Mar 28, 2025
@amotin amotin merged commit 30cc233 into openzfs:master Mar 28, 2025
31 of 37 checks passed
@ixhamza ixhamza deleted the NAS-135033 branch March 28, 2025 19:51
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Apr 3, 2025
In addition to hotplug events, the kernel may also mark a failing vdev
as REMOVED. This was observed in a customer report and reproduced by
forcing the NVMe host driver to disable the device after a failed reset
due to command timeout. In such cases, the spare was not activated
because the device had already transitioned to a REMOVED state before
zed processed the event.
To address this, explicitly attempt hot spare activation when the
kernel marks a device as REMOVED.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes openzfs#17187
fuporovvStack pushed a commit to fuporovvStack/zfs that referenced this pull request Apr 11, 2025
In addition to hotplug events, the kernel may also mark a failing vdev
as REMOVED. This was observed in a customer report and reproduced by
forcing the NVMe host driver to disable the device after a failed reset
due to command timeout. In such cases, the spare was not activated
because the device had already transitioned to a REMOVED state before
zed processed the event.
To address this, explicitly attempt hot spare activation when the
kernel marks a device as REMOVED.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes openzfs#17187
fuporovvStack pushed a commit to fuporovvStack/zfs that referenced this pull request Apr 11, 2025
In addition to hotplug events, the kernel may also mark a failing vdev
as REMOVED. This was observed in a customer report and reproduced by
forcing the NVMe host driver to disable the device after a failed reset
due to command timeout. In such cases, the spare was not activated
because the device had already transitioned to a REMOVED state before
zed processed the event.
To address this, explicitly attempt hot spare activation when the
kernel marks a device as REMOVED.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes openzfs#17187
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Apr 16, 2025
In addition to hotplug events, the kernel may also mark a failing vdev
as REMOVED. This was observed in a customer report and reproduced by
forcing the NVMe host driver to disable the device after a failed reset
due to command timeout. In such cases, the spare was not activated
because the device had already transitioned to a REMOVED state before
zed processed the event.
To address this, explicitly attempt hot spare activation when the
kernel marks a device as REMOVED.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes openzfs#17187
defaziogiancarlo pushed a commit to LLNL/zfs that referenced this pull request Jun 2, 2025
In addition to hotplug events, the kernel may also mark a failing vdev
as REMOVED. This was observed in a customer report and reproduced by
forcing the NVMe host driver to disable the device after a failed reset
due to command timeout. In such cases, the spare was not activated
because the device had already transitioned to a REMOVED state before
zed processed the event.
To address this, explicitly attempt hot spare activation when the
kernel marks a device as REMOVED.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes openzfs#17187
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants