DLPX-83701 Make function mnt_add_count() traceable #18

don-brady · 2022-12-07T22:22:31Z

Background

Some of the kernel functions in the unmount path are not traceable which makes it harder to debug busy unmounts.

Problem

To help with diagnosing busy unmounts we should allow tracing of mnt_add_count() and do_umount() in bpftrace probes.

Solution

Augment these functions with noinline and __noclone (disables the "-fipa-sra" compilation optimization to keep the compiler from optimizing the function signatures).

Testing Done

Confirm these function symbols are traceable:

delphix@ip-10-110-220-172:~$ sudo bpftrace -l '*mnt_add_count'
kprobe:mnt_add_count
delphix@ip-10-110-220-172:~$ sudo bpftrace -l '*do_umount'
kprobe:do_umount

Also used them with a bpftrace script that watches mnt_add_count calls that decrement the mount reference after a busy unmount occurred.

BugLink: https://bugs.launchpad.net/bugs/1990009 Tested on x86-64 and Ilya was also kind enough to give it a spin on s390x, both passing with probe_user:OK there. The test is using the newly added bpf_probe_read_user() to dump sockaddr from connect call into .bss BPF map and overrides the user buffer via bpf_probe_write_user(): # ./test_progs [...] #17 pkt_md_access:OK #18 probe_user:OK #19 prog_run_xattr:OK [...] Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Tested-by: Ilya Leoshkevich <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/90f449d8af25354e05080e82fc6e2d3179da30ea.1572649915.git.daniel@iogearbox.net (cherry picked from commit fa553d9) Signed-off-by: Tim Gardner <[email protected]> Acked-by: Cengiz Can <[email protected]> Acked-by: Joseph Salisbury <[email protected]> Signed-off-by: Tim Gardner <[email protected]>

BugLink: https://bugs.launchpad.net/bugs/2002347 [ Upstream commit 5dd7caf ] In __unregister_kprobe_top(), if the currently unregistered probe has post_handler but other child probes of the aggrprobe do not have post_handler, the post_handler of the aggrprobe is cleared. If this is a ftrace-based probe, there is a problem. In later calls to disarm_kprobe(), we will use kprobe_ftrace_ops because post_handler is NULL. But we're armed with kprobe_ipmodify_ops. This triggers a WARN in __disarm_kprobe_ftrace() and may even cause use-after-free: Failed to disarm kprobe-ftrace at kernel_clone+0x0/0x3c0 (error -2) WARNING: CPU: 5 PID: 137 at kernel/kprobes.c:1135 __disarm_kprobe_ftrace.isra.21+0xcf/0xe0 Modules linked in: testKprobe_007(-) CPU: 5 PID: 137 Comm: rmmod Not tainted 6.1.0-rc4-dirty #18 [...] Call Trace: <TASK> __disable_kprobe+0xcd/0xe0 __unregister_kprobe_top+0x12/0x150 ? mutex_lock+0xe/0x30 unregister_kprobes.part.23+0x31/0xa0 unregister_kprobe+0x32/0x40 __x64_sys_delete_module+0x15e/0x260 ? do_user_addr_fault+0x2cd/0x6b0 do_syscall_64+0x3a/0x90 entry_SYSCALL_64_after_hwframe+0x63/0xcd [...] For the kprobe-on-ftrace case, we keep the post_handler setting to identify this aggrprobe armed with kprobe_ipmodify_ops. This way we can disarm it correctly. Link: https://lore.kernel.org/all/[email protected]/ Fixes: 0bc11ed ("kprobes: Allow kprobes coexist with livepatch") Reported-by: Zhao Gongyi <[email protected]> Suggested-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Li Huafei <[email protected]> Acked-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Masami Hiramatsu (Google) <[email protected]> Signed-off-by: Sasha Levin <[email protected]> Signed-off-by: Kamal Mostafa <[email protected]> Signed-off-by: Stefan Bader <[email protected]>

…g the sock BugLink: https://bugs.launchpad.net/bugs/2003914 [ Upstream commit 3cf7203 ] There is a race condition in vxlan that when deleting a vxlan device during receiving packets, there is a possibility that the sock is released after getting vxlan_sock vs from sk_user_data. Then in later vxlan_ecn_decapsulate(), vxlan_get_sk_family() we will got NULL pointer dereference. e.g. #0 [ffffa25ec6978a38] machine_kexec at ffffffff8c669757 #1 [ffffa25ec6978a90] __crash_kexec at ffffffff8c7c0a4d #2 [ffffa25ec6978b58] crash_kexec at ffffffff8c7c1c48 #3 [ffffa25ec6978b60] oops_end at ffffffff8c627f2b #4 [ffffa25ec6978b80] page_fault_oops at ffffffff8c678fcb #5 [ffffa25ec6978bd8] exc_page_fault at ffffffff8d109542 #6 [ffffa25ec6978c00] asm_exc_page_fault at ffffffff8d200b62 [exception RIP: vxlan_ecn_decapsulate+0x3b] RIP: ffffffffc1014e7b RSP: ffffa25ec6978cb0 RFLAGS: 00010246 RAX: 0000000000000008 RBX: ffff8aa000888000 RCX: 0000000000000000 RDX: 000000000000000e RSI: ffff8a9fc7ab803e RDI: ffff8a9fd1168700 RBP: ffff8a9fc7ab803e R8: 0000000000700000 R9: 00000000000010ae R10: ffff8a9fcb748980 R11: 0000000000000000 R12: ffff8a9fd1168700 R13: ffff8aa000888000 R14: 00000000002a0000 R15: 00000000000010ae ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #7 [ffffa25ec6978ce8] vxlan_rcv at ffffffffc10189cd [vxlan] #8 [ffffa25ec6978d90] udp_queue_rcv_one_skb at ffffffff8cfb6507 #9 [ffffa25ec6978dc0] udp_unicast_rcv_skb at ffffffff8cfb6e45 #10 [ffffa25ec6978dc8] __udp4_lib_rcv at ffffffff8cfb8807 #11 [ffffa25ec6978e20] ip_protocol_deliver_rcu at ffffffff8cf76951 #12 [ffffa25ec6978e48] ip_local_deliver at ffffffff8cf76bde #13 [ffffa25ec6978ea0] __netif_receive_skb_one_core at ffffffff8cecde9b #14 [ffffa25ec6978ec8] process_backlog at ffffffff8cece139 #15 [ffffa25ec6978f00] __napi_poll at ffffffff8ceced1a #16 [ffffa25ec6978f28] net_rx_action at ffffffff8cecf1f3 #17 [ffffa25ec6978fa0] __softirqentry_text_start at ffffffff8d4000ca #18 [ffffa25ec6978ff0] do_softirq at ffffffff8c6fbdc3 Reproducer: https://github.com/Mellanox/ovs-tests/blob/master/test-ovs-vxlan-remove-tunnel-during-traffic.sh Fix this by waiting for all sk_user_data reader to finish before releasing the sock. Reported-by: Jianlin Shi <[email protected]> Suggested-by: Jakub Sitnicki <[email protected]> Fixes: 6a93cc9 ("udp-tunnel: Add a few more UDP tunnel APIs") Signed-off-by: Hangbin Liu <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Sasha Levin <[email protected]> Signed-off-by: Kamal Mostafa <[email protected]> Signed-off-by: Stefan Bader <[email protected]>

BugLink: https://bugs.launchpad.net/bugs/2003914 [ Upstream commit b6702a9 ] syzkaller reported use-after-free with the stack trace like below [1]: [ 38.960489][ C3] ================================================================== [ 38.963216][ C3] BUG: KASAN: use-after-free in ar5523_cmd_tx_cb+0x220/0x240 [ 38.964950][ C3] Read of size 8 at addr ffff888048e03450 by task swapper/3/0 [ 38.966363][ C3] [ 38.967053][ C3] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 6.0.0-09039-ga6afa4199d3d-dirty #18 [ 38.968464][ C3] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.fc36 04/01/2014 [ 38.969959][ C3] Call Trace: [ 38.970841][ C3] <IRQ> [ 38.971663][ C3] dump_stack_lvl+0xfc/0x174 [ 38.972620][ C3] print_report.cold+0x2c3/0x752 [ 38.973626][ C3] ? ar5523_cmd_tx_cb+0x220/0x240 [ 38.974644][ C3] kasan_report+0xb1/0x1d0 [ 38.975720][ C3] ? ar5523_cmd_tx_cb+0x220/0x240 [ 38.976831][ C3] ar5523_cmd_tx_cb+0x220/0x240 [ 38.978412][ C3] __usb_hcd_giveback_urb+0x353/0x5b0 [ 38.979755][ C3] usb_hcd_giveback_urb+0x385/0x430 [ 38.981266][ C3] dummy_timer+0x140c/0x34e0 [ 38.982925][ C3] ? notifier_call_chain+0xb5/0x1e0 [ 38.984761][ C3] ? rcu_read_lock_sched_held+0xb/0x60 [ 38.986242][ C3] ? lock_release+0x51c/0x790 [ 38.987323][ C3] ? _raw_read_unlock_irqrestore+0x37/0x70 [ 38.988483][ C3] ? __wake_up_common_lock+0xde/0x130 [ 38.989621][ C3] ? reacquire_held_locks+0x4a0/0x4a0 [ 38.990777][ C3] ? lock_acquire+0x472/0x550 [ 38.991919][ C3] ? rcu_read_lock_sched_held+0xb/0x60 [ 38.993138][ C3] ? lock_acquire+0x472/0x550 [ 38.994890][ C3] ? dummy_urb_enqueue+0x860/0x860 [ 38.996266][ C3] ? do_raw_spin_unlock+0x16f/0x230 [ 38.997670][ C3] ? dummy_urb_enqueue+0x860/0x860 [ 38.999116][ C3] call_timer_fn+0x1a0/0x6a0 [ 39.000668][ C3] ? add_timer_on+0x4a0/0x4a0 [ 39.002137][ C3] ? reacquire_held_locks+0x4a0/0x4a0 [ 39.003809][ C3] ? __next_timer_interrupt+0x226/0x2a0 [ 39.005509][ C3] __run_timers.part.0+0x69a/0xac0 [ 39.007025][ C3] ? dummy_urb_enqueue+0x860/0x860 [ 39.008716][ C3] ? call_timer_fn+0x6a0/0x6a0 [ 39.010254][ C3] ? cpuacct_percpu_seq_show+0x10/0x10 [ 39.011795][ C3] ? kvm_sched_clock_read+0x14/0x40 [ 39.013277][ C3] ? sched_clock_cpu+0x69/0x2b0 [ 39.014724][ C3] run_timer_softirq+0xb6/0x1d0 [ 39.016196][ C3] __do_softirq+0x1d2/0x9be [ 39.017616][ C3] __irq_exit_rcu+0xeb/0x190 [ 39.019004][ C3] irq_exit_rcu+0x5/0x20 [ 39.020361][ C3] sysvec_apic_timer_interrupt+0x8f/0xb0 [ 39.021965][ C3] </IRQ> [ 39.023237][ C3] <TASK> In ar5523_probe(), ar5523_host_available() calls ar5523_cmd() as below (there are other functions which finally call ar5523_cmd()): ar5523_probe() -> ar5523_host_available() -> ar5523_cmd_read() -> ar5523_cmd() If ar5523_cmd() timed out, then ar5523_host_available() failed and ar5523_probe() freed the device structure. So, ar5523_cmd_tx_cb() might touch the freed structure. This patch fixes this issue by canceling in-flight tx cmd if submitted urb timed out. Link: https://syzkaller.appspot.com/bug?id=9e12b2d54300842b71bdd18b54971385ff0d0d3a [1] Reported-by: [email protected] Signed-off-by: Shigeru Yoshida <[email protected]> Signed-off-by: Kalle Valo <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sasha Levin <[email protected]> Signed-off-by: Kamal Mostafa <[email protected]> Signed-off-by: Stefan Bader <[email protected]>

BugLink: https://bugs.launchpad.net/bugs/2003914 [ Upstream commit 031af50 ] The inline assembly for arm64's cmpxchg_double*() implementations use a +Q constraint to hazard against other accesses to the memory location being exchanged. However, the pointer passed to the constraint is a pointer to unsigned long, and thus the hazard only applies to the first 8 bytes of the location. GCC can take advantage of this, assuming that other portions of the location are unchanged, leading to a number of potential problems. This is similar to what we fixed back in commit: fee960b ("arm64: xchg: hazard against entire exchange variable") ... but we forgot to adjust cmpxchg_double*() similarly at the same time. The same problem applies, as demonstrated with the following test: | struct big { | u64 lo, hi; | } __aligned(128); | | unsigned long foo(struct big *b) | { | u64 hi_old, hi_new; | | hi_old = b->hi; | cmpxchg_double_local(&b->lo, &b->hi, 0x12, 0x34, 0x56, 0x78); | hi_new = b->hi; | | return hi_old ^ hi_new; | } ... which GCC 12.1.0 compiles as: | 0000000000000000 <foo>: | 0: d503233f paciasp | 4: aa0003e4 mov x4, x0 | 8: 1400000e b 40 <foo+0x40> | c: d2800240 mov x0, #0x12 // #18 | 10: d2800681 mov x1, #0x34 // #52 | 14: aa0003e5 mov x5, x0 | 18: aa0103e6 mov x6, x1 | 1c: d2800ac2 mov x2, #0x56 // #86 | 20: d2800f03 mov x3, #0x78 // #120 | 24: 48207c82 casp x0, x1, x2, x3, [x4] | 28: ca050000 eor x0, x0, x5 | 2c: ca060021 eor x1, x1, x6 | 30: aa010000 orr x0, x0, x1 | 34: d2800000 mov x0, #0x0 // #0 <--- BANG | 38: d50323bf autiasp | 3c: d65f03c0 ret | 40: d2800240 mov x0, #0x12 // #18 | 44: d2800681 mov x1, #0x34 // #52 | 48: d2800ac2 mov x2, #0x56 // #86 | 4c: d2800f03 mov x3, #0x78 // #120 | 50: f9800091 prfm pstl1strm, [x4] | 54: c87f1885 ldxp x5, x6, [x4] | 58: ca0000a5 eor x5, x5, x0 | 5c: ca0100c6 eor x6, x6, x1 | 60: aa0600a6 orr x6, x5, x6 | 64: b5000066 cbnz x6, 70 <foo+0x70> | 68: c8250c82 stxp w5, x2, x3, [x4] | 6c: 35ffff45 cbnz w5, 54 <foo+0x54> | 70: d2800000 mov x0, #0x0 // #0 <--- BANG | 74: d50323bf autiasp | 78: d65f03c0 ret Notice that at the lines with "BANG" comments, GCC has assumed that the higher 8 bytes are unchanged by the cmpxchg_double() call, and that `hi_old ^ hi_new` can be reduced to a constant zero, for both LSE and LL/SC versions of cmpxchg_double(). This patch fixes the issue by passing a pointer to __uint128_t into the +Q constraint, ensuring that the compiler hazards against the entire 16 bytes being modified. With this change, GCC 12.1.0 compiles the above test as: | 0000000000000000 <foo>: | 0: f9400407 ldr x7, [x0, #8] | 4: d503233f paciasp | 8: aa0003e4 mov x4, x0 | c: 1400000f b 48 <foo+0x48> | 10: d2800240 mov x0, #0x12 // #18 | 14: d2800681 mov x1, #0x34 // #52 | 18: aa0003e5 mov x5, x0 | 1c: aa0103e6 mov x6, x1 | 20: d2800ac2 mov x2, #0x56 // #86 | 24: d2800f03 mov x3, #0x78 // #120 | 28: 48207c82 casp x0, x1, x2, x3, [x4] | 2c: ca050000 eor x0, x0, x5 | 30: ca060021 eor x1, x1, x6 | 34: aa010000 orr x0, x0, x1 | 38: f9400480 ldr x0, [x4, #8] | 3c: d50323bf autiasp | 40: ca0000e0 eor x0, x7, x0 | 44: d65f03c0 ret | 48: d2800240 mov x0, #0x12 // #18 | 4c: d2800681 mov x1, #0x34 // #52 | 50: d2800ac2 mov x2, #0x56 // #86 | 54: d2800f03 mov x3, #0x78 // #120 | 58: f9800091 prfm pstl1strm, [x4] | 5c: c87f1885 ldxp x5, x6, [x4] | 60: ca0000a5 eor x5, x5, x0 | 64: ca0100c6 eor x6, x6, x1 | 68: aa0600a6 orr x6, x5, x6 | 6c: b5000066 cbnz x6, 78 <foo+0x78> | 70: c8250c82 stxp w5, x2, x3, [x4] | 74: 35ffff45 cbnz w5, 5c <foo+0x5c> | 78: f9400480 ldr x0, [x4, #8] | 7c: d50323bf autiasp | 80: ca0000e0 eor x0, x7, x0 | 84: d65f03c0 ret ... sampling the high 8 bytes before and after the cmpxchg, and performing an EOR, as we'd expect. For backporting, I've tested this atop linux-4.9.y with GCC 5.5.0. Note that linux-4.9.y is oldest currently supported stable release, and mandates GCC 5.1+. Unfortunately I couldn't get a GCC 5.1 binary to run on my machines due to library incompatibilities. I've also used a standalone test to check that we can use a __uint128_t pointer in a +Q constraint at least as far back as GCC 4.8.5 and LLVM 3.9.1. Fixes: 5284e1b ("arm64: xchg: Implement cmpxchg_double") Fixes: e9a4b79 ("arm64: cmpxchg_dbl: patch in lse instructions when supported by the CPU") Reported-by: Boqun Feng <[email protected]> Link: https://lore.kernel.org/lkml/Y6DEfQXymYVgL3oJ@boqun-archlinux/ Reported-by: Peter Zijlstra <[email protected]> Link: https://lore.kernel.org/lkml/[email protected]/ Signed-off-by: Mark Rutland <[email protected]> Cc: [email protected] Cc: Arnd Bergmann <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Steve Capper <[email protected]> Cc: Will Deacon <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> Signed-off-by: Sasha Levin <[email protected]> Signed-off-by: Kamal Mostafa <[email protected]> Signed-off-by: Stefan Bader <[email protected]>

BugLink: https://bugs.launchpad.net/bugs/1990009 Tested on x86-64 and Ilya was also kind enough to give it a spin on s390x, both passing with probe_user:OK there. The test is using the newly added bpf_probe_read_user() to dump sockaddr from connect call into .bss BPF map and overrides the user buffer via bpf_probe_write_user(): # ./test_progs [...] #17 pkt_md_access:OK #18 probe_user:OK #19 prog_run_xattr:OK [...] Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Tested-by: Ilya Leoshkevich <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/90f449d8af25354e05080e82fc6e2d3179da30ea.1572649915.git.daniel@iogearbox.net (cherry picked from commit fa553d9) Signed-off-by: Tim Gardner <[email protected]> Acked-by: Cengiz Can <[email protected]> Acked-by: Joseph Salisbury <[email protected]> Signed-off-by: Tim Gardner <[email protected]>

BugLink: https://bugs.launchpad.net/bugs/2076435 commit be346c1 upstream. The code in ocfs2_dio_end_io_write() estimates number of necessary transaction credits using ocfs2_calc_extend_credits(). This however does not take into account that the IO could be arbitrarily large and can contain arbitrary number of extents. Extent tree manipulations do often extend the current transaction but not in all of the cases. For example if we have only single block extents in the tree, ocfs2_mark_extent_written() will end up calling ocfs2_replace_extent_rec() all the time and we will never extend the current transaction and eventually exhaust all the transaction credits if the IO contains many single block extents. Once that happens a WARN_ON(jbd2_handle_buffer_credits(handle) <= 0) is triggered in jbd2_journal_dirty_metadata() and subsequently OCFS2 aborts in response to this error. This was actually triggered by one of our customers on a heavily fragmented OCFS2 filesystem. To fix the issue make sure the transaction always has enough credits for one extent insert before each call of ocfs2_mark_extent_written(). Heming Zhao said: ------ PANIC: "Kernel panic - not syncing: OCFS2: (device dm-1): panic forced after error" PID: xxx TASK: xxxx CPU: 5 COMMAND: "SubmitThread-CA" #0 machine_kexec at ffffffff8c069932 #1 __crash_kexec at ffffffff8c1338fa #2 panic at ffffffff8c1d69b9 #3 ocfs2_handle_error at ffffffffc0c86c0c [ocfs2] #4 __ocfs2_abort at ffffffffc0c88387 [ocfs2] #5 ocfs2_journal_dirty at ffffffffc0c51e98 [ocfs2] #6 ocfs2_split_extent at ffffffffc0c27ea3 [ocfs2] #7 ocfs2_change_extent_flag at ffffffffc0c28053 [ocfs2] #8 ocfs2_mark_extent_written at ffffffffc0c28347 [ocfs2] #9 ocfs2_dio_end_io_write at ffffffffc0c2bef9 [ocfs2] #10 ocfs2_dio_end_io at ffffffffc0c2c0f5 [ocfs2] #11 dio_complete at ffffffff8c2b9fa7 #12 do_blockdev_direct_IO at ffffffff8c2bc09f #13 ocfs2_direct_IO at ffffffffc0c2b653 [ocfs2] #14 generic_file_direct_write at ffffffff8c1dcf14 #15 __generic_file_write_iter at ffffffff8c1dd07b #16 ocfs2_file_write_iter at ffffffffc0c49f1f [ocfs2] #17 aio_write at ffffffff8c2cc72e #18 kmem_cache_alloc at ffffffff8c248dde #19 do_io_submit at ffffffff8c2ccada #20 do_syscall_64 at ffffffff8c004984 #21 entry_SYSCALL_64_after_hwframe at ffffffff8c8000ba Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: c15471f ("ocfs2: fix sparse file & data ordering issue in direct io") Signed-off-by: Jan Kara <[email protected]> Reviewed-by: Joseph Qi <[email protected]> Reviewed-by: Heming Zhao <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Changwei Ge <[email protected]> Cc: Gang He <[email protected]> Cc: Jun Piao <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Portia Stephens <[email protected]> Signed-off-by: Roxana Nicolescu <[email protected]>

BugLink: https://bugs.launchpad.net/bugs/2078289 [ Upstream commit f0c1802 ] When running BPF selftests (./test_progs -t sockmap_basic) on a Loongarch platform, the following kernel panic occurs: [...] Oops[#1]: CPU: 22 PID: 2824 Comm: test_progs Tainted: G OE 6.10.0-rc2+ #18 Hardware name: LOONGSON Dabieshan/Loongson-TC542F0, BIOS Loongson-UDK2018 ... ... ra: 90000000048bf6c0 sk_msg_recvmsg+0x120/0x560 ERA: 9000000004162774 copy_page_to_iter+0x74/0x1c0 CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE) PRMD: 0000000c (PPLV0 +PIE +PWE) EUEN: 00000007 (+FPE +SXE +ASXE -BTE) ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7) ESTAT: 00010000 [PIL] (IS= ECode=1 EsubCode=0) BADV: 0000000000000040 PRID: 0014c011 (Loongson-64bit, Loongson-3C5000) Modules linked in: bpf_testmod(OE) xt_CHECKSUM xt_MASQUERADE xt_conntrack Process test_progs (pid: 2824, threadinfo=0000000000863a31, task=...) Stack : ... Call Trace: [<9000000004162774>] copy_page_to_iter+0x74/0x1c0 [<90000000048bf6c0>] sk_msg_recvmsg+0x120/0x560 [<90000000049f2b90>] tcp_bpf_recvmsg_parser+0x170/0x4e0 [<90000000049aae34>] inet_recvmsg+0x54/0x100 [<900000000481ad5c>] sock_recvmsg+0x7c/0xe0 [<900000000481e1a8>] __sys_recvfrom+0x108/0x1c0 [<900000000481e27c>] sys_recvfrom+0x1c/0x40 [<9000000004c076ec>] do_syscall+0x8c/0xc0 [<9000000003731da4>] handle_syscall+0xc4/0x160 Code: ... ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Fatal exception Kernel relocated by 0x3510000 .text @ 0x9000000003710000 .data @ 0x9000000004d70000 .bss @ 0x9000000006469400 ---[ end Kernel panic - not syncing: Fatal exception ]--- [...] This crash happens every time when running sockmap_skb_verdict_shutdown subtest in sockmap_basic. This crash is because a NULL pointer is passed to page_address() in the sk_msg_recvmsg(). Due to the different implementations depending on the architecture, page_address(NULL) will trigger a panic on Loongarch platform but not on x86 platform. So this bug was hidden on x86 platform for a while, but now it is exposed on Loongarch platform. The root cause is that a zero length skb (skb->len == 0) was put on the queue. This zero length skb is a TCP FIN packet, which was sent by shutdown(), invoked in test_sockmap_skb_verdict_shutdown(): shutdown(p1, SHUT_WR); In this case, in sk_psock_skb_ingress_enqueue(), num_sge is zero, and no page is put to this sge (see sg_set_page in sg_set_page), but this empty sge is queued into ingress_msg list. And in sk_msg_recvmsg(), this empty sge is used, and a NULL page is got by sg_page(sge). Pass this NULL page to copy_page_to_iter(), which passes it to kmap_local_page() and to page_address(), then kernel panics. To solve this, we should skip this zero length skb. So in sk_msg_recvmsg(), if copy is zero, that means it's a zero length skb, skip invoking copy_page_to_iter(). We are using the EFAULT return triggered by copy_page_to_iter to check for is_fin in tcp_bpf.c. Fixes: 604326b ("bpf, sockmap: convert to generic sk_msg interface") Suggested-by: John Fastabend <[email protected]> Signed-off-by: Geliang Tang <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/e3a16eacdc6740658ee02a33489b1b9d4912f378.1719992715.git.tanggeliang@kylinos.cn Signed-off-by: Sasha Levin <[email protected]> Signed-off-by: Portia Stephens <[email protected]> Signed-off-by: Stefan Bader <[email protected]>

BugLink: https://bugs.launchpad.net/bugs/2100983 commit 6e64d6b3a3c39655de56682ec83e894978d23412 upstream. In commit e4b5ccd392b9 ("drm/v3d: Ensure job pointer is set to NULL after job completion"), we introduced a change to assign the job pointer to NULL after completing a job, indicating job completion. However, this approach created a race condition between the DRM scheduler workqueue and the IRQ execution thread. As soon as the fence is signaled in the IRQ execution thread, a new job starts to be executed. This results in a race condition where the IRQ execution thread sets the job pointer to NULL simultaneously as the `run_job()` function assigns a new job to the pointer. This race condition can lead to a NULL pointer dereference if the IRQ execution thread sets the job pointer to NULL after `run_job()` assigns it to the new job. When the new job completes and the GPU emits an interrupt, `v3d_irq()` is triggered, potentially causing a crash. [ 466.310099] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000c0 [ 466.318928] Mem abort info: [ 466.321723] ESR = 0x0000000096000005 [ 466.325479] EC = 0x25: DABT (current EL), IL = 32 bits [ 466.330807] SET = 0, FnV = 0 [ 466.333864] EA = 0, S1PTW = 0 [ 466.337010] FSC = 0x05: level 1 translation fault [ 466.341900] Data abort info: [ 466.344783] ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000 [ 466.350285] CM = 0, WnR = 0, TnD = 0, TagAccess = 0 [ 466.355350] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 466.360677] user pgtable: 4k pages, 39-bit VAs, pgdp=0000000089772000 [ 466.367140] [00000000000000c0] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000 [ 466.375875] Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP [ 466.382163] Modules linked in: rfcomm snd_seq_dummy snd_hrtimer snd_seq snd_seq_device algif_hash algif_skcipher af_alg bnep binfmt_misc vc4 snd_soc_hdmi_codec drm_display_helper cec brcmfmac_wcc spidev rpivid_hevc(C) drm_client_lib brcmfmac hci_uart drm_dma_helper pisp_be btbcm brcmutil snd_soc_core aes_ce_blk v4l2_mem2mem bluetooth aes_ce_cipher snd_compress videobuf2_dma_contig ghash_ce cfg80211 gf128mul snd_pcm_dmaengine videobuf2_memops ecdh_generic sha2_ce ecc videobuf2_v4l2 snd_pcm v3d sha256_arm64 rfkill videodev snd_timer sha1_ce libaes gpu_sched snd videobuf2_common sha1_generic drm_shmem_helper mc rp1_pio drm_kms_helper raspberrypi_hwmon spi_bcm2835 gpio_keys i2c_brcmstb rp1 raspberrypi_gpiomem rp1_mailbox rp1_adc nvmem_rmem uio_pdrv_genirq uio i2c_dev drm ledtrig_pattern drm_panel_orientation_quirks backlight fuse dm_mod ip_tables x_tables ipv6 [ 466.458429] CPU: 0 UID: 1000 PID: 2008 Comm: chromium Tainted: G C 6.13.0-v8+ #18 [ 466.467336] Tainted: [C]=CRAP [ 466.470306] Hardware name: Raspberry Pi 5 Model B Rev 1.0 (DT) [ 466.476157] pstate: 404000c9 (nZcv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 466.483143] pc : v3d_irq+0x118/0x2e0 [v3d] [ 466.487258] lr : __handle_irq_event_percpu+0x60/0x228 [ 466.492327] sp : ffffffc080003ea0 [ 466.495646] x29: ffffffc080003ea0 x28: ffffff80c0c94200 x27: 0000000000000000 [ 466.502807] x26: ffffffd08dd81d7b x25: ffffff80c0c94200 x24: ffffff8003bdc200 [ 466.509969] x23: 0000000000000001 x22: 00000000000000a7 x21: 0000000000000000 [ 466.517130] x20: ffffff8041bb0000 x19: 0000000000000001 x18: 0000000000000000 [ 466.524291] x17: ffffffafadfb0000 x16: ffffffc080000000 x15: 0000000000000000 [ 466.531452] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 [ 466.538613] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffffd08c527eb0 [ 466.545777] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000 [ 466.552941] x5 : ffffffd08c4100d0 x4 : ffffffafadfb0000 x3 : ffffffc080003f70 [ 466.560102] x2 : ffffffc0829e8058 x1 : 0000000000000001 x0 : 0000000000000000 [ 466.567263] Call trace: [ 466.569711] v3d_irq+0x118/0x2e0 [v3d] (P) [ 466.573826] __handle_irq_event_percpu+0x60/0x228 [ 466.578546] handle_irq_event+0x54/0xb8 [ 466.582391] handle_fasteoi_irq+0xac/0x240 [ 466.586498] generic_handle_domain_irq+0x34/0x58 [ 466.591128] gic_handle_irq+0x48/0xd8 [ 466.594798] call_on_irq_stack+0x24/0x58 [ 466.598730] do_interrupt_handler+0x88/0x98 [ 466.602923] el0_interrupt+0x44/0xc0 [ 466.606508] __el0_irq_handler_common+0x18/0x28 [ 466.611050] el0t_64_irq_handler+0x10/0x20 [ 466.615156] el0t_64_irq+0x198/0x1a0 [ 466.618740] Code: 52800035 3607faf3 f9442e80 52800021 (f9406018) [ 466.624853] ---[ end trace 0000000000000000 ]--- [ 466.629483] Kernel panic - not syncing: Oops: Fatal exception in interrupt [ 466.636384] SMP: stopping secondary CPUs [ 466.640320] Kernel Offset: 0x100c400000 from 0xffffffc080000000 [ 466.646259] PHYS_OFFSET: 0x0 [ 466.649141] CPU features: 0x100,00000170,00901250,0200720b [ 466.654644] Memory Limit: none [ 466.657706] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]--- Fix the crash by assigning the job pointer to NULL before signaling the fence. This ensures that the job pointer is cleared before any new job starts execution, preventing the race condition and the NULL pointer dereference crash. Cc: [email protected] Fixes: e4b5ccd392b9 ("drm/v3d: Ensure job pointer is set to NULL after job completion") Signed-off-by: Maíra Canal <[email protected]> Reviewed-by: Jose Maria Casanova Crespo <[email protected]> Reviewed-by: Iago Toral Quiroga <[email protected]> Tested-by: Phil Elwell <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Noah Wager <[email protected]> Signed-off-by: Koichiro Den <[email protected]>

DLPX-83701 Make function mnt_add_count() traceable

4821180

pcd1193182 approved these changes Dec 7, 2022

View reviewed changes

tonynguien approved these changes Dec 8, 2022

View reviewed changes

don-brady merged commit d5fef04 into delphix:6.0/stage Dec 9, 2022

don-brady deleted the dlpx-83701-azure branch December 9, 2022 16:40

delphix-devops-bot pushed a commit that referenced this pull request Dec 15, 2022

DLPX-83701 Make function mnt_add_count() traceable (#18)

ff54082

delphix-devops-bot pushed a commit that referenced this pull request Jan 11, 2023

DLPX-83701 Make function mnt_add_count() traceable (#18)

bff8965

delphix-devops-bot pushed a commit that referenced this pull request Feb 2, 2023

DLPX-83701 Make function mnt_add_count() traceable (#18)

346b4d2

delphix-devops-bot pushed a commit that referenced this pull request Feb 10, 2023

DLPX-83701 Make function mnt_add_count() traceable (#18)

b17b225

delphix-devops-bot pushed a commit that referenced this pull request Mar 4, 2023

DLPX-83701 Make function mnt_add_count() traceable (#18)

7b00e24

prakashsurya pushed a commit that referenced this pull request Mar 14, 2023

DLPX-83701 Make function mnt_add_count() traceable (#18)

1dc7a6f

prakashsurya pushed a commit that referenced this pull request Mar 14, 2023

DLPX-83701 Make function mnt_add_count() traceable (#18)

379fa4f

delphix-devops-bot pushed a commit that referenced this pull request Mar 30, 2023

DLPX-83701 Make function mnt_add_count() traceable (#18)

4427b2e

delphix-devops-bot pushed a commit that referenced this pull request Apr 20, 2023

DLPX-83701 Make function mnt_add_count() traceable (#18)

e3ec2d8

delphix-devops-bot pushed a commit that referenced this pull request Apr 28, 2023

DLPX-83701 Make function mnt_add_count() traceable (#18)

67a697f

delphix-devops-bot pushed a commit that referenced this pull request May 26, 2023

DLPX-83701 Make function mnt_add_count() traceable (#18)

30cedad

delphix-devops-bot pushed a commit that referenced this pull request Mar 24, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

cbc69a2

delphix-devops-bot pushed a commit that referenced this pull request Mar 25, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

1e8e1b9

jwk404 pushed a commit to jwk404/linux-kernel-azure that referenced this pull request Mar 25, 2024

DLPX-83701 Make function mnt_add_count() traceable (delphix#18)

4464f65

delphix-devops-bot pushed a commit that referenced this pull request Mar 26, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

c3b6902

delphix-devops-bot pushed a commit that referenced this pull request Mar 27, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

d31be43

jwk404 pushed a commit that referenced this pull request Apr 10, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

5c42d3d

jwk404 pushed a commit that referenced this pull request Apr 10, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

022d480

jwk404 pushed a commit that referenced this pull request Apr 11, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

76a1774

jwk404 pushed a commit that referenced this pull request Apr 14, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

4052cb5

jwk404 pushed a commit that referenced this pull request Apr 15, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

1d75088

jwk404 pushed a commit that referenced this pull request Apr 15, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

a22d072

jwk404 pushed a commit that referenced this pull request Apr 15, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

e0a78c6

delphix-devops-bot pushed a commit that referenced this pull request Apr 20, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

6bc26a7

delphix-devops-bot pushed a commit that referenced this pull request May 9, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

f033f01

delphix-devops-bot pushed a commit that referenced this pull request May 16, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

80e6ca4

delphix-devops-bot pushed a commit that referenced this pull request Jun 30, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

832bcb0

delphix-devops-bot pushed a commit that referenced this pull request Aug 1, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

2da8be5

delphix-devops-bot pushed a commit that referenced this pull request Aug 6, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

b7cf946

pcd1193182 pushed a commit to pcd1193182/linux-kernel-azure that referenced this pull request Aug 19, 2024

DLPX-83701 Make function mnt_add_count() traceable (delphix#18)

e24687c

delphix-devops-bot pushed a commit that referenced this pull request Aug 22, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

2279326

delphix-devops-bot pushed a commit that referenced this pull request Aug 23, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

8446bb3

prakashsurya pushed a commit that referenced this pull request Sep 23, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

9017853

delphix-devops-bot pushed a commit that referenced this pull request Oct 20, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

19735c0

palash-gandhi pushed a commit that referenced this pull request Oct 24, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

e465da9

delphix-devops-bot pushed a commit that referenced this pull request Nov 23, 2024

DLPX-83701 Make function mnt_add_count() traceable (#18)

b126c22

delphix-devops-bot pushed a commit that referenced this pull request Jan 10, 2025

DLPX-83701 Make function mnt_add_count() traceable (#18)

9c66b13

delphix-devops-bot pushed a commit that referenced this pull request Feb 1, 2025

DLPX-83701 Make function mnt_add_count() traceable (#18)

8db813c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DLPX-83701 Make function mnt_add_count() traceable #18

DLPX-83701 Make function mnt_add_count() traceable #18

Uh oh!

don-brady commented Dec 7, 2022

Uh oh!

Uh oh!

DLPX-83701 Make function mnt_add_count() traceable #18

DLPX-83701 Make function mnt_add_count() traceable #18

Uh oh!

Conversation

don-brady commented Dec 7, 2022

Background

Problem

Solution

Testing Done

Uh oh!

Uh oh!