Skip to content

[sled-agent] Destroy orphaned datasets (PR 3/2) #8323

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

jgallagher
Copy link
Contributor

Builds on #8302. This allows the config-reconciler to destroy durable datasets it believes are orphaned due to expunged Omicron zones. This is disabled by default, and can be enabled on a sled-by-sled case via a new omdb sled-agent chicken-switch destroy-orphans enable subcommand (with related commands to get and disable the same).

We don't want to ship automatic dataset deletion before R17 (R16 should only ship "report orphaned datasets"), but we need to be able to turn it on for upgrade testing in the meantime. All of this chicken-switch stuff should be removeable after R16, once we're comfortable enabling deletion in general.

Base automatically changed from john/sled-agent-config-reconciler-report-orphaned-datasets-inventory to main June 12, 2025 13:33
@jgallagher jgallagher force-pushed the john/sled-agent-destroy-orphaned-datasets branch from f3ac7a8 to fce9033 Compare June 12, 2025 13:36
/// control "chicken switches" (potentially-destructive sled-agent behavior
/// that can be toggled on or off via `omdb`)
#[clap(subcommand)]
ChickenSwitch(ChickenSwitchCommands),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐔

List,
enum ChickenSwitchCommands {
/// interact with the "destroy orphaned datasets" chicken switch
DestroyOrphans(DestroyOrphansArgs),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't get over the name of this variant.

@@ -65,6 +66,7 @@ pub(crate) fn spawn<T: SledAgentFacilities>(
currently_managed_zpools_tx: watch::Sender<Arc<CurrentlyManagedZpools>>,
external_disks_tx: watch::Sender<HashSet<Disk>>,
raw_disks_rx: RawDisksReceiver,
destroy_orphans: Arc<AtomicBool>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've said it before. All you need is an AtomicBool!

.datasets_report_orphans(
datasets.clone(),
currently_managed_zpools,
self.destroy_orphans.load(Ordering::Relaxed),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay, proper usage of ordering!

@jgallagher
Copy link
Contributor Author

Tested enabling this on london during an upgrade today, and enabling the switch seems to have worked and correctly destroyed orphans.

One case that would have failed without this is this internal DNS dataset, which we expunged and replaced on the same zpool:

*   oxp_204f2891-299e-4c76-8ea0-a40bfd307c45/crypt/internal_dns                                                 c843fe79-82b0-4598-8286-48447b0a49a1   - in service   none      none          off
     └─                                                                                                                                                + expunged
+   oxp_204f2891-299e-4c76-8ea0-a40bfd307c45/crypt/internal_dns                                                 ea8a9f32-c8fd-45d4-a5a2-fdbc580bf4c5   in service     none      none          off

Looking at the zpool history, we see when RSS created the initial dataset, and when sled-agent destroyed it and replaced it with the new one:

# Initial dataset created during RSS (ID c843fe79-82b0-4598-8286-48447b0a49a1 matches now-expunged dataset)
1986-12-28.00:11:04 zfs create -o zoned=on -o canmount=on -o mountpoint=/data oxp_204f2891-299e-4c76-8ea0-a40bfd307c45/crypt/internal_dns
1986-12-28.00:11:05 zfs set quota=none reservation=none compression=off oxide:uuid=c843fe79-82b0-4598-8286-48447b0a49a1 oxp_204f2891-299e-4c76-8ea0-a40bfd307c45/crypt/internal_dns

# sled-agent destroyed it
2025-06-13.15:17:28 zfs destroy -r oxp_204f2891-299e-4c76-8ea0-a40bfd307c45/crypt/internal_dns

# New dataset (ID ea8a9f32-c8fd-45d4-a5a2-fdbc580bf4c5 matches now-added dataset)
2025-06-13.15:17:32 zfs create -o zoned=on -o canmount=on -o mountpoint=/data oxp_204f2891-299e-4c76-8ea0-a40bfd307c45/crypt/internal_dns
2025-06-13.15:17:32 zfs set quota=none reservation=none compression=off oxide:uuid=ea8a9f32-c8fd-45d4-a5a2-fdbc580bf4c5 oxp_204f2891-299e-4c76-8ea0-a40bfd307c45/crypt/internal_dns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants