[2.3.2] Revert "zinject: count matches and injections for each handler" #17137

robn · 2025-03-12T04:45:49Z

Motivation and Context

2.3.1 introduced a userspace/kernel ABI break that I missed. As a result, 2.3.1 userspace tools are incompatible with <2.3.1 kernel modules for many tasks.

After I updated userspace to 2.3.1, zfs send started returning EINVAL, and zfs events returned EBADF. I rebooted and didn't think much of it; my local machine is weird sometimes.

Later I noticed why: the "match counts" change adds a couple of fields to zinject_record_t. There's one of those embedded in the middle of zfs_cmd_t. zc_cleanup_fd and some of the send params both occur after that field. The kernel and userspace were running on different layouts, so couldn't find what they needed.

I don't know if you want to revert the change and cut a 2.3.2 release. If you do, this is the revert. It might be unnecessary though. depends how long before folks can reboot.

(also, an apology from me: feels like something I should have noticed. I'm gonna go put some compile time size checks on a few things to hopefully help prevent this next time).

Description

Adding fields to zinject_record_t unexpectedly extended zfs_cmd_t, preventing some things working properly with 2.3.1 userspace tools against 2.3.0 kernel module.

This reverts commit fabdd50.

How Has This Been Tested?

Compile checked only.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

Adding fields to zinject_record_t unexpectedly extended zfs_cmd_t, preventing some things working properly with 2.3.1 userspace tools against 2.3.0 kernel module. This reverts commit fabdd50. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>

vedranmiletic · 2025-03-12T09:58:51Z

Not everybody upgrades immediately after a release comes out, so it might be worth it.

amotin

If it affects things outside of zinject itself (I hoped it won't) and we will to release 2.3.2 soon enough, then it makes sense.

Anuskuss · 2025-04-02T22:43:57Z

I got bit pretty hard yesterday, and I think it was related to this.
I almost never reboot my server and v2.3.1 reached bookworm-backports a couple of days ago which my unattended-upgrades picked up. Yesterday I SSHed into my server and went into .zfs/snapshots to retrieve a file I had accidentally modified, which resulted in my SSH session locking up. I waited a few minutes in case there was a timeout which could've saved the situation but it never came. So I closed the session, unable to start a new one. I had a backup plan in case something like this happend where I could send a reboot command (which I don't like doing; every reboot bears the risk of my server not waking up) via the web server but that also didn't help. Everything else was working fine but this was certainly concerning. All out of options I had to grab my keys and drive to my server (I have no physical access to it and there were no humans around) to plug in a keyboard (blinly because I didn't have a display) and try to log in. Nothing happend though and I can only presume that it didn't let me in because i was in the midst of a reboot (/etc/nologin). So after repeatedly pressing the power button I unfortunately had to cut my losses and pull the plug 😔. I don't think I had any data loss but that still sucked majorly. I specifically chose ZFS because I want my data to be safe.

Anyway would you be able to infer from this short log snippet that this was indeed caused by fabdd50?

kernel: INFO: task mount.zfs:3522765 blocked for more than 120 seconds.
kernel:       Tainted: P     U     OE      6.1.0-28-amd64 #1 Debian 6.1.119-1
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: task:mount.zfs       state:D stack:0     pid:3522765 ppid:3522764 flags:0x00000002
kernel: Call Trace:
kernel:  <TASK>
kernel:  __schedule+0x34d/0x9e0
kernel:  schedule+0x5a/0xd0
kernel:  schedule_timeout+0x118/0x150
kernel:  wait_for_completion_state+0x14b/0x220
kernel:  call_usermodehelper_exec+0x16a/0x1a0
kernel:  zfsctl_snapshot_mount+0x246/0x500 [zfs]
kernel:  zpl_snapdir_automount+0xc/0x20 [zfs]
kernel:  __traverse_mounts+0x8c/0x210
kernel:  step_into+0x340/0x760
kernel:  path_lookupat+0x67/0x190
kernel:  filename_lookup+0xe4/0x1f0
kernel:  ? getname_flags.part.0+0x4b/0x1c0
kernel:  user_path_at_empty+0x36/0x50
kernel:  user_statfs+0x54/0xd0
kernel:  __do_sys_statfs+0x35/0x70
kernel:  do_syscall_64+0x55/0xb0
kernel:  ? exit_to_user_mode_prepare+0x40/0x1e0
kernel:  ? syscall_exit_to_user_mode+0x1e/0x40
kernel:  ? do_syscall_64+0x61/0xb0
kernel:  ? zfs_ioc_pool_get_props+0xdc/0x180 [zfs]
kernel:  ? zfsdev_ioctl_common+0x630/0xa00 [zfs]
kernel:  ? free_unref_page_commit+0x76/0x170
kernel:  ? free_unref_page+0x15f/0x1d0
kernel:  ? zfsdev_ioctl+0x9e/0xd0 [zfs]
kernel:  ? exit_to_user_mode_prepare+0x40/0x1e0
kernel:  ? syscall_exit_to_user_mode+0x1e/0x40
kernel:  ? do_syscall_64+0x61/0xb0
kernel:  ? do_user_addr_fault+0x1b0/0x550
kernel:  ? exit_to_user_mode_prepare+0x40/0x1e0
kernel:  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
kernel: RIP: 0033:0x7f5c41068bb7
kernel: RSP: 002b:00007ffc5dc2ff78 EFLAGS: 00000246 ORIG_RAX: 0000000000000089
kernel: RAX: ffffffffffffffda RBX: 0000557343a8ed30 RCX: 00007f5c41068bb7
kernel: RDX: 0000000000000000 RSI: 00007ffc5dc2ffa0 RDI: 00007ffc5dc333c0
kernel: RBP: 00007ffc5dc33290 R08: 0000000000000000 R09: 0000000000000000
kernel: R10: 0000000000000100 R11: 0000000000000246 R12: 00007ffc5dc30240
kernel: R13: 00007ffc5dc333c0 R14: 0000000000000000 R15: 0000557343a84300
kernel:  </TASK>

Because if it did, I would urge you to release v2.3.2 ASAP. I got screwed already but there're still people out there waiting to be saved!

robn · 2025-04-03T00:51:46Z

I can't say for certain without more study, but I think it's highly plausible. The auto-mount behaviour calls out to the userspace mount.zfs command, which then makes calls back into OpenZFS. I could totally see that getting wildly confused somewhere and melting down.

#17190 looks like the same thing.

@tonyhutter we should probably get a 2.3.2 out there as soon as we can (though maybe the horse has bolted already). If you want to, and I can help somewhere, please let me know.

robn · 2025-04-03T00:59:46Z

Actually, thinking more. If 2.3.1 is already out there, then we're just going to blow up again when 2.3.2 userspace gets out there. Should I do a patch to try to support both?

tonyhutter · 2025-04-03T16:48:12Z

Actually, thinking more. If 2.3.1 is already out there, then we're just going to blow up again when 2.3.2 userspace gets out there. Should I do a patch to try to support both?

I just put out the 2.3.2 proposed pathset: #17214. Let me know if you want to add an additional patch to support both, or bail on mismatched kernel/userspace for zfs send.

robn · 2025-04-03T23:58:48Z

@tonyhutter seems wider than that, if it can contribute to the snapshot automount request infinite looping.

In any case, I'm not sure much can be done outside of querying the kernel module version number, and I don't know if I trust that since vendors patch things in surprising ways. I had hoped I could find a way to get the kernel module to tell me something I could use to infer the size of its zfs_cmd_t so userspace could adjust appropriately, but I couldn't see a way to get it.

I'm not personally too bothered about it; like, I hate that it happened, but that ship has sailed. But if there's a quick way to detect it that might not cause further problems, I'll do something, otherwise lets just press on.

robn requested review from behlendorf and tonyhutter March 12, 2025 04:46

amotin approved these changes Mar 12, 2025

View reviewed changes

tonyhutter approved these changes Mar 12, 2025

View reviewed changes

amotin mentioned this pull request Mar 17, 2025

zed process exits after "Processing events since eid=0" #17153

Open

amotin linked an issue Mar 17, 2025 that may be closed by this pull request

zed process exits after "Processing events since eid=0" #17153

Open

tonyhutter merged commit 5f70370 into openzfs:zfs-2.3.2-staging Mar 24, 2025
31 of 37 checks passed

robn mentioned this pull request Apr 3, 2025

ls of /.zfs/snapshot subdir during zfs package upgrade spawns 3000+ mount procs and blocks #17190

Open

srithon mentioned this pull request Apr 16, 2025

PANIC: VERIFY(avl_find(tree, new_node, &where) == NULL) failed #16685

Open

amarshall mentioned this pull request Apr 23, 2025

nixos: pkgs.zfs -> config.boot.zfs.package NixOS/nixpkgs#401016

Draft

13 tasks

robn mentioned this pull request May 12, 2025

zfs send -I is broken with binary from 2.3.2 and kernel module 2.3.1 #17330

Closed

nyanloutre mentioned this pull request May 13, 2025

zfs: ZFS Event Daemon (zed) fails to start NixOS/nixpkgs#406430

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[2.3.2] Revert "zinject: count matches and injections for each handler" #17137

[2.3.2] Revert "zinject: count matches and injections for each handler" #17137

Uh oh!

robn commented Mar 12, 2025

Uh oh!

vedranmiletic commented Mar 12, 2025

Uh oh!

amotin left a comment

Uh oh!

Uh oh!

Anuskuss commented Apr 2, 2025

Uh oh!

robn commented Apr 3, 2025

Uh oh!

robn commented Apr 3, 2025

Uh oh!

tonyhutter commented Apr 3, 2025 •

edited

Loading

Uh oh!

robn commented Apr 3, 2025

Uh oh!

Uh oh!

[2.3.2] Revert "zinject: count matches and injections for each handler" #17137

[2.3.2] Revert "zinject: count matches and injections for each handler" #17137

Uh oh!

Conversation

robn commented Mar 12, 2025

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

vedranmiletic commented Mar 12, 2025

Uh oh!

amotin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Anuskuss commented Apr 2, 2025

Uh oh!

robn commented Apr 3, 2025

Uh oh!

robn commented Apr 3, 2025

Uh oh!

tonyhutter commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robn commented Apr 3, 2025

Uh oh!

Uh oh!

tonyhutter commented Apr 3, 2025 •

edited

Loading