Skip to content

tests/blutooth/tester: ASSERTION FAIL due to Recursive spinlock when running bt tester on qemu-cortex-m3 #13591

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghost opened this issue Feb 20, 2019 · 19 comments
Assignees
Labels
area: Bluetooth area: QEMU QEMU Emulation area: Tests Issues related to a particular existing or missing test bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug Stale

Comments

@ghost
Copy link

ghost commented Feb 20, 2019

Describe the bug
When running Bluetooth Tester(tests/bluetooth/tester) for AutoPTS testing on Qemu-cortex-m3, there is ASSERTION FAIL due to Recursive spinlock while processing the events.

To Reproduce
Steps to reproduce the behavior:

  1. Running autoptsserver running on Windows
  2. Build image
    $ cd tests/bluetooth/tester
    $ mkdir build; cd build
    $ cmake -DBOARD=qemu_cortex_m3 -DCONF_FILE=qemu.conf ..
    $ make
  3. Use autopts to run the test. ex:
    $ ./autoptsclient-zephyr.py "c:\Users\Tester\Documents\Profile Tuning Suite\autopts\autopts.pqw6" /home/tester/working/zephyr/workspace/zephyr/tests/bluetooth/tester/build/zephyr/zephyr.elf -i 192.168.56.101 -l 192.168.56.1 -c GAP/BROB/BCST/BV-01-C --debug
  4. Test will fail with BTP_TIMEOUT
  5. Check the iut-zephyr.log

Expected behavior
No crash or assertion

Impact
Test fails

Screenshots or console output
Output from iut-zephyr.log:
[00:00:00.250,162] bt_hci_core.process_events: count 2
[00:00:00.250,165] bt_hci_core.process_events: ev->state 4
[00:00:00.250,166] bt_hci_core.send_cmd: calling net_buf_get
[00:00:00.250,228] bt_hci_core.send_cmd: calling sem_take_wait
[00:00:00.250,230] bt_hci_core.send_cmd: Sending command 0x1009 (buf 0x20005200) to driver
[00:00:00.250,245] bt_hci_core.bt_send: buf 0x20005200 len 3 type 0
[00:00:00.250,272] bt_hci_core.process_events: ev->state 0
[00:00:00.250,278] bt_conn.bt_conn_prepare_events:
[00:00:00.250,282] bt_hci_core.hci_tx_thread: Calling k_poll with 2 events
[00:00:00.251,492] bt_hci_core.bt_buf_get_cmd_complete: sent_cmd 0x20005200
[00:00:00.251,888] bt_hci_core.hci_cmd_complete: opcode 0x1009
[00:00:00.251,896] bt_hci_core.hci_cmd_done: opcode 0x1009 status 0x00 buf 0x20005200
[00:00:00.251,917] bt_hci_core.bt_hci_cmd_send_sync: opcode 0x1009 status 0x00
[00:00:00.251,938] bt_hci_core.read_bdaddr_complete: status 0
[00:00:00.252,025] bt_hci_core.bt_hci_cmd_create: opcode 0x1002 param_len 0
[00:00:00.252,032] bt_hci_core.bt_hci_cmd_create: buf 0x20005200
[00:00:00.252,036] bt_hci_core.bt_hci_cmd_send_sync: buf 0x20005200 opcode 0x1002 len 3
[00:00:00.252,085] bt_hci_core.process_events: count 2
[00:00:00.252,090] bt_hci_core.process_events: ev->state 4
[00:00:00.252,094] bt_hci_core.send_cmd: calling net_buf_get
[00:00:00.252,098] bt_hci_core.send_cmd: calling sem_take_wait
[00:00:00.252,101] bt_hci_core.send_cmd: Sending command 0x1002 (buf 0x20005200) to driver
[00:00:00.252,106] bt_hci_core.bt_send: buf 0x20005200 len 3 type 0
[00:00:00.252,162] bt_hci_core.process_events: ev->state 0
[00:00:00.252,170] bt_conn.bt_conn_prepare_events:
[00:00:00.252,193] bt_hci_core.hci_tx_thread: Calling k_poll with 2 events
ASSERTION FAIL [z_spin_lock_valid(l)] @ /home/tester/working/zephyr/workspace/zephyr/include/spinlock.h:66
Recursive spinlock
[00:00:00.257,828] bt_hci_core.bt_buf_get_cmd_complete: sent_cmd 0x20005200
***** HARD FAULT *****
Fault escalation (see below)
***** Hardware exception *****
Current thread ID = 0x20002848
Faulting instruction address = 0x3bfc
Fatal fault in essential thread! Spinning...
qemu-system-arm: terminating on signal 15 from pid 29649 (python)

Environment (please complete the following information):

  • OS: Linux
  • Toolchain: Zephyr SDK 0.9.5
  • Commit: b75115a
@ghost ghost added the bug The issue is a bug, or the PR is fixing a bug label Feb 20, 2019
@ghost ghost changed the title tests/blutooth/tester: ASSERTION FAIL due to Recursive spinlock when running on qemu-cortex-m3 tests/blutooth/tester: ASSERTION FAIL due to Recursive spinlock when running bt tester on qemu-cortex-m3 Feb 20, 2019
@carlescufi
Copy link
Member

carlescufi commented Feb 20, 2019

This looks similar to #13411 (already fixed and closed) and #13289 (still not resolved)

@nashif nashif added the priority: medium Medium impact/importance bug label Feb 21, 2019
@ghost
Copy link
Author

ghost commented Mar 8, 2019

I tried with the latest tip (b710177) and still was able to reproduce the issue.
Attacahed the log files.
iut-zephyr.log

@andyross
Copy link
Collaborator

@Bluespring - alas I still don't have that test rig set up. Almost certainly this is a bug introduced during the spinlockification series. Can you try reverting each of those in sequence and tell me which patch broke it? You can start with that HEAD (b710177 -- but probably not current master due to the giant renaming patch that went in) and try sequentially reverting:

8a3d57b6cc kernel/userspace: Spinlockification
b29fb220b1 kernel/timer: Spinlockify
f582b55dd6 kernel/pipe: Spinlockify
d27d4e6af2 kernel/sched: Remove remaining irq_lock use
be03dbd4c7 kernel/msg_q: Spinlockify
f0933d0ded kernel/stack: Spinlockify
9eeb6b8779 kernel/mbox: Spinlockify
7df0216d1e kernel/mutex: Spinlockify
603ea42764 kernel/queue: Spinlockify
c0bdcbaaf8 kernel/mem_slab: Spinlockify
e456d0f7dd kernel/thread: Spinlockify
84b47a9290 kernel/mempool: Spinlockify
f2b1a4bb64 kernel/poll: Spinlockify
1bf9bd04b1 kernel: Add _unlocked() variant to context switch primitives
da37a53a54 kernel/k_sem: Spinlockify
ec554f44d9 kernel: Split reschdule & pend into irq/spin lock versions
fb505b3cfd spinlock: Support ztest mocking
04382b9a2a kernel/mem_domain: Spinlockify
32a29d2805 kernel/atomic_c: Spinlockify
a37a981b21 kernel/work_q: Spinlockify

@andyross
Copy link
Collaborator

(I should point out: doing a git bisect will cut that work by about 2-3x, but requires a little study & setup if you haven't done it the first time)

@tedd-an
Copy link
Collaborator

tedd-an commented Mar 12, 2019

@andyross Thanks for the git bisect tips. It was very useful command.
It looks like the issue didn't cause by the commits you listed.
Instead, From the testing, it caused by this commit:

commit 7367b84f8e6e496babff0e34df53ae9a39d910f5
Author: Andy Ross <[email protected]>
Date:   Wed Jan 30 12:27:43 2019 -0800

    kernel/spinlock: Augment runtime validation

    There was an existing validation layer in the spinlock implementation,
    but it was only enabled when both SMP and CONFIG_DEBUG were enabled,
    which meant that nothing was using it.  Replace it with a more
    elaborate framework that ensures that every lock taken is not already
    taken by the current CPU and is released on the same CPU by the same
    thread.

    This catches the much more common goof of locking a spinlock
    recursively, which would "work" on uniprocessor setups but have the
    side effect of releasing the lock prematurely at the end of the inner
    lock.  We've done that in two spots already.

    Note that this patch causes k_spinlock_t to have non-zero size on
    builds with CONFIG_ASSERT, so expect a little data and code size
    increase.  Worth it IMHO.

    Signed-off-by: Andy Ross <[email protected]>

:040000 040000 9065a1a59623fd0c2b113bc04c1deeb21d4e0256 c9af6cefac23dda58cf0e9a6f70d48d172dedf30 M	include

And this is the git bisect report:

[tester@localhost zephyr]$ git bisect log
git bisect start
# good: [fe7f1e17deb0e019123ab96beba40acdb762f47e] Bluetooth: Take advantage of net_buf_pull_mem()
git bisect good fe7f1e17deb0e019123ab96beba40acdb762f47e
# bad: [31503a868b889413ed27180e8799188802fedca1] arch: xtensa: dts: Move HAS_DTS to arch level
git bisect bad 31503a868b889413ed27180e8799188802fedca1
# bad: [e526a2b80a2eaeef4aa30514ae5760d7ce005f17] host-gcc: exclude -lgcc to fix -mx32 [qemu_]x86_64 regression
git bisect bad e526a2b80a2eaeef4aa30514ae5760d7ce005f17
# good: [003bdb5fb65aa4f20f27edca88c270e4b2299b94] led_strip: lpd880x: Add Device Tree support
git bisect good 003bdb5fb65aa4f20f27edca88c270e4b2299b94
# good: [31b3941d88c4fa01704d93f0f9dfa1588fbe2b0a] ext: hal: atmel: sam: re-add missing header symbols following HAL import
git bisect good 31b3941d88c4fa01704d93f0f9dfa1588fbe2b0a
# bad: [e2b62b411be5f6c93a66c4474529f7e238858f3d] samples: openamp: Update sample to OpenAMP v2018.10 release
git bisect bad e2b62b411be5f6c93a66c4474529f7e238858f3d
# good: [5bb07d61d0163f5f8cb837a2a957c604ddc470fe] CODEOWNERS: add myself as maintainer for west commands
git bisect good 5bb07d61d0163f5f8cb837a2a957c604ddc470fe
# good: [a7342daedc945c0cc57affcb619013b6329c2482] boards: rv32m1_vega: fix the location of the YAML file
git bisect good a7342daedc945c0cc57affcb619013b6329c2482
# bad: [8b583acf23dd25ea771857c61a0723d17945f3aa] kernel/timeout: Fix another recursive spinlock()
git bisect bad 8b583acf23dd25ea771857c61a0723d17945f3aa
# bad: [7367b84f8e6e496babff0e34df53ae9a39d910f5] kernel/spinlock: Augment runtime validation
git bisect bad 7367b84f8e6e496babff0e34df53ae9a39d910f5
# good: [732cfac519ee7822b0d19eba5ef946f30649204d] samples: mesh: removed bug which is cause of divide by zero exception
git bisect good 732cfac519ee7822b0d19eba5ef946f30649204d
# first bad commit: [7367b84f8e6e496babff0e34df53ae9a39d910f5] kernel/spinlock: Augment runtime validation

@andyross
Copy link
Collaborator

Huh, well that's interesting. So, obviously that's not the patch that introduced the bug, that's the patch that introduced the assertion that caught the bug. :) But this is actually in the tree before the spinlockification series, and before that there were vanishingly few spinlocks being held anywhere...

I'll get you a patch tomorrow to try that will give use some better idea of where the failure is happening. But barring that I may need to bug you for instructions on duplicating this test setup.

@andyross
Copy link
Collaborator

OK, try this when you get a chance. It will log the file/line location of the failed spinlock, as well as the location where it was already acquired. Just pull and cherry pick the first commit in pull request #14286

@tedd-an
Copy link
Collaborator

tedd-an commented Mar 12, 2019

Hi @andyross

This is the output with #14286

***** Booting Zephyr OS v1.14.0-rc1-1023-gaf5c44ca29ed *****
[00:00:00.000,000] <dbg> bt_hci_core.bt_hci_driver_register: Registered H:4
ASSERTION FAIL [z_spin_lock_valid(l)] @ ../../../../include/spinlock.h:86
	Recursive spinlock @/home/tester/working/zephyr/zephyr/kernel/timeout.c:110 (taken at /home/tester/working/zephyr/zephyr/kernel/timeout.c:144)
***** HARD FAULT *****
  Fault escalation (see below)
***** Hardware exception *****
Current thread ID = 0x200028cc
Faulting instruction address = 0x3c08
Fatal fault in essential thread! Spinning...
qemu-system-arm: terminating on signal 15 from pid 13147 (python)

Another similar error but different source line:

[00:00:01.540,000] <dbg> bt_hci_core.hci_cmd_status: opcode 0x2025
[00:00:01.540,000] <dbg> bt_hci_core.hci_cmd_done: opcode 0x2025 status 0x00 buf 0x200052c8
[00:00:01.540,000] <dbg> bt_hci_core.process_events: ev->state 0
[00:00:01.540,000] <dbg> bt_conn.bt_conn_prepare_events: 
[00:00:01.540,000] <dbg> bt_hci_core.hci_tx_thread: Calling k_poll with 2 events
[00:00:01.540,000] <dbg> bt_hci_core.bt_hci_cmd_send_sync: opcode 0x2025 status 0x00
[00:00:01.540,000] <dbg> bt_l2cap.bt_l2cap_le_fixed_chan_register: CID 0x0005
[00:00:01.540,000] <dbg> bt_hci_ecc.emulate_le_p256_public_key_cmd: 
[00:00:01.560,000] <dbg> bt_hci_core.bt_recv: buf 0x20005394 len 68
[00:00:01.560,000] <dbg> bt_hci_core.hci_event: event 0x3e
[00:00:01.560,000] <dbg> bt_hci_core.hci_le_meta_event: subevent 0x08
[00:00:01.560,000] <dbg> bt_hci_core.le_pkey_complete: status: 0x0
[00:00:01.560,000] <dbg> bt_smp.bt_smp_pkey_ready: 
[00:00:01.560,000] <inf> kernel: ecc stack :	unused 232	usage 868 / 1100 (78 %)
[00:00:01.560,000] <dbg> bt_l2cap.bt_l2cap_server_register: PSM 0x0080
ASSERTION FAIL [z_spin_lock_valid(l)] @ ../../../../include/spinlock.h:86
	Recursive spinlock @/home/tester/working/zephyr/zephyr/kernel/timeout.c:110 (taken at /home/tester/working/zephyr/zephyr/kernel/timeout.c:152)
***** HARD FAULT *****
  Fault escalation (see below)
***** Hardware exception *****
Current thread ID = 0x200028cc
Faulting instruction address = 0x3c08
Fatal fault in essential thread! Spinning...
qemu-system-arm: terminating on signal 15 from pid 13147 (python)

@andyross
Copy link
Collaborator

OK, those have me confused. Both of those "parent" spots are quite clearly non-reentrant. You can follow the execution by reading source and see that the process is guaranteed to exit the lock before trying to swap or otherwise being interrupted. This "can't happen", basically.

I first got scared that maybe something in the BT layer was registering a nonmaskable interrupt and trying to call into zephyr out of it), but a quick check shows that's not the case (the nRF hardware has one of those for the radio, but it's carefully segregated from the kernel).

So I'm going to go with "probably stack overflow" as my best guess. Some of the stacks in that app when I build it are indeed kinda small. Can you add the following to to your default.conf and see if they change behavior:

CONFIG_ISR_STACK_SIZE=4096
CONFIG_MAIN_STACK_SIZE=4096
CONFIG_IDLE_STACK_SIZE=4096
CONFIG_SYSTEM_WORKQUEUE_STACK_SIZE=4096
CONFIG_OFFLOAD_WORKQUEUE_STACK_SIZE=4096
CONFIG_BT_HCI_TX_STACK_SIZE=4096
CONFIG_BT_RX_STACK_SIZE=4096
CONFIG_LOG_PROCESS_THREAD_STACK_SIZE=4096

Another possibility is that maybe something is wrong with the validation layer, or the code is using a corrupt spinlock struct or something. What happens if you set CONFIG_ASSERT=n? Do you get correct behavior or a crash further along?

@tedd-an
Copy link
Collaborator

tedd-an commented Mar 13, 2019

Hi @andyross,

Even after increasing the stack size to your suggested value, the same assertion still occurred.

But, when I disabled the assert with CONFIG_ASSERT=n, it was running fine (at least what I read from the logs) and tests started to pass. I am going to run more tests overnight and update the result in the morning.

@tedd-an
Copy link
Collaborator

tedd-an commented Mar 13, 2019

It went well with overnight testing and no crash or exception occurred during the test.

@andyross
Copy link
Collaborator

Well... that may speak to priority I guess if we believe it to be a problem in the validation layer. I'm still at a loss as to the root cause. Overnight I did realize there's an edge case where if a thread takes a spinlock and then aborts itself, it will then swap away with the spinlock held and with strange validation metadata still sitting in there for other threads to trip on. But (1) that doesn't match the line numbers above, neither of which are in a critical section that can abort and (2) would be really weird. Let me stare at the code some more.

@andyross
Copy link
Collaborator

If you get a chance, can you pull with #13800 and test with that? It includes one other validation patch that doesn't seem to match the issue above, but... who knows, might catch something going wrong earlier. I added the new patch to that one and rebased it, so it should be current.

@tedd-an
Copy link
Collaborator

tedd-an commented Mar 13, 2019

I tested with new patch and here is the output. I don't see any new information, I guess.

***** Booting Zephyr OS v1.14.0-rc1-1012-g82ad69dac908 *****
[00:00:00.000,000] <dbg> bt_hci_core.bt_hci_driver_register: Registered H:4
ASSERTION FAIL [z_spin_lock_valid(l)] @ ../../../../include/spinlock.h:86
	Recursive spinlock @/home/tester/working/zephyr/zephyr/kernel/timeout.c:110 (taken at /home/tester/working/zephyr/zephyr/kernel/timeout.c:144)
***** HARD FAULT *****
  Fault escalation (see below)
***** Hardware exception *****
Current thread ID = 0x200028cc
Faulting instruction address = 0x3c08
Fatal fault in essential thread! Spinning...

@rljordan-zz rljordan-zz added priority: low Low impact/importance bug and removed priority: medium Medium impact/importance bug labels Mar 15, 2019
@andyross
Copy link
Collaborator

Reprioritized this to low, given that it seems to be an artifact of the spinlock validation layer, doesn't produce a failure in actual code (and frankly a recursive spinlock on a uniprocessor architecture is not, in fact, a failure case -- it works, we just don't like it because the same code will deadlock in SMP).

@andyross
Copy link
Collaborator

#14520 is plausibly another symptom of the same issue

@carlescufi
Copy link
Member

@andyross @Bluespring is this still an issue?

@tedd-an
Copy link
Collaborator

tedd-an commented Aug 16, 2019

Yes, it is still reproducible.

***** Booting Zephyr OS build v2.0.0-rc1-64-g5962aab42f4d *****
[00:00:00.000,000] �[0m<dbg> bt_hci_core.bt_hci_driver_register: Registered H:4�[0m
[00:00:00.000,000] �[0m<dbg> bt_hci_core.hci_tx_thread: Started�[0m
[00:00:00.000,000] �[0m<dbg> bt_conn.bt_conn_prepare_events: �[0m
[00:00:00.000,000] �[0m<dbg> bt_hci_core.hci_tx_thread: Calling k_poll with 2 events�[0m
[00:00:00.000,000] �[0m<dbg> bt_hci_core.bt_hci_cmd_create: opcode 0x0c03 param_len 0�[0m
[00:00:00.000,000] �[0m<dbg> bt_hci_core.bt_hci_cmd_create: buf 0x200067b8�[0m
[00:00:00.000,000] �[0m<dbg> bt_hci_core.bt_hci_cmd_send_sync: buf 0x200067b8 opcode 0x0c03 len 3�[0m
[00:00:00.000,000] �[0m<dbg> bt_hci_core.process_events: count 2�[0m
[00:00:00.000,000] �[0m<dbg> bt_hci_core.process_events: ev->state 4�[0m
[00:00:00.000,000] �[0m<dbg> bt_hci_core.send_cmd: calling net_buf_get�[0m
[00:00:00.000,000] �[0m<dbg> bt_hci_core.send_cmd: calling sem_take_wait�[0m
[00:00:00.000,000] �[0m<dbg> bt_hci_core.send_cmd: Sending command 0x0c03 (buf 0x200067b8) to driver�[0m
[00:00:00.000,000] �[0m<dbg> bt_hci_core.bt_send: buf 0x200067b8 len 3 type 0�[0m
[00:00:00.000,000] �[0m<dbg> bt_hci_core.process_events: ev->state 0�[0m
[00:00:00.000,000] �[0m<dbg> bt_conn.bt_conn_prepare_events: �[0m
[00:00:00.000,000] �[0m<dbg> bt_hci_core.hci_tx_thread: Calling k_poll with 2 events�[0m
ASSERTION FAIL [z_spin_lock_valid(l)] @ ../include/spinlock.h:78
	Recursive spinlock
FATAL: ***** HARD FAULT *****
FATAL:   Fault escalation (see below)
FATAL: r0/a1:  0x00000004  r1/a2:  0x0000004e  r2/a3:  0x90000000
FATAL: r3/a4:  0x0002b7df r12/ip:  0xaaaaaaaa r14/lr:  0x00023a0b
FATAL:  xpsr:  0x61000031
FATAL: Faulting instruction address (r15/pc): 0x00024e96
FATAL: >>> ZEPHYR FATAL ERROR 0: CPU exception
FATAL: Current thread: 0x200021c0 (idle)
[00:00:00.100,000] �[0m<dbg> bt_hci_core.bt_buf_get_cmd_complete: sent_cmd 0x200067b8�[0m
FATAL: Halting system

@carlescufi carlescufi added area: Tests Issues related to a particular existing or missing test area: QEMU QEMU Emulation area: Bluetooth labels Apr 30, 2020
@github-actions
Copy link

This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Bluetooth area: QEMU QEMU Emulation area: Tests Issues related to a particular existing or missing test bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug Stale
Projects
None yet
Development

No branches or pull requests

5 participants