-
Notifications
You must be signed in to change notification settings - Fork 7.5k
tests/blutooth/tester: ASSERTION FAIL due to Recursive spinlock when running bt tester on qemu-cortex-m3 #13591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I tried with the latest tip (b710177) and still was able to reproduce the issue. |
@Bluespring - alas I still don't have that test rig set up. Almost certainly this is a bug introduced during the spinlockification series. Can you try reverting each of those in sequence and tell me which patch broke it? You can start with that HEAD (b710177 -- but probably not current master due to the giant renaming patch that went in) and try sequentially reverting:
|
(I should point out: doing a git bisect will cut that work by about 2-3x, but requires a little study & setup if you haven't done it the first time) |
@andyross Thanks for the git bisect tips. It was very useful command.
And this is the git bisect report:
|
Huh, well that's interesting. So, obviously that's not the patch that introduced the bug, that's the patch that introduced the assertion that caught the bug. :) But this is actually in the tree before the spinlockification series, and before that there were vanishingly few spinlocks being held anywhere... I'll get you a patch tomorrow to try that will give use some better idea of where the failure is happening. But barring that I may need to bug you for instructions on duplicating this test setup. |
OK, try this when you get a chance. It will log the file/line location of the failed spinlock, as well as the location where it was already acquired. Just pull and cherry pick the first commit in pull request #14286 |
Hi @andyross This is the output with #14286
Another similar error but different source line:
|
OK, those have me confused. Both of those "parent" spots are quite clearly non-reentrant. You can follow the execution by reading source and see that the process is guaranteed to exit the lock before trying to swap or otherwise being interrupted. This "can't happen", basically. I first got scared that maybe something in the BT layer was registering a nonmaskable interrupt and trying to call into zephyr out of it), but a quick check shows that's not the case (the nRF hardware has one of those for the radio, but it's carefully segregated from the kernel). So I'm going to go with "probably stack overflow" as my best guess. Some of the stacks in that app when I build it are indeed kinda small. Can you add the following to to your default.conf and see if they change behavior:
Another possibility is that maybe something is wrong with the validation layer, or the code is using a corrupt spinlock struct or something. What happens if you set CONFIG_ASSERT=n? Do you get correct behavior or a crash further along? |
Hi @andyross, Even after increasing the stack size to your suggested value, the same assertion still occurred. But, when I disabled the assert with |
It went well with overnight testing and no crash or exception occurred during the test. |
Well... that may speak to priority I guess if we believe it to be a problem in the validation layer. I'm still at a loss as to the root cause. Overnight I did realize there's an edge case where if a thread takes a spinlock and then aborts itself, it will then swap away with the spinlock held and with strange validation metadata still sitting in there for other threads to trip on. But (1) that doesn't match the line numbers above, neither of which are in a critical section that can abort and (2) would be really weird. Let me stare at the code some more. |
If you get a chance, can you pull with #13800 and test with that? It includes one other validation patch that doesn't seem to match the issue above, but... who knows, might catch something going wrong earlier. I added the new patch to that one and rebased it, so it should be current. |
I tested with new patch and here is the output. I don't see any new information, I guess.
|
Reprioritized this to low, given that it seems to be an artifact of the spinlock validation layer, doesn't produce a failure in actual code (and frankly a recursive spinlock on a uniprocessor architecture is not, in fact, a failure case -- it works, we just don't like it because the same code will deadlock in SMP). |
#14520 is plausibly another symptom of the same issue |
@andyross @Bluespring is this still an issue? |
Yes, it is still reproducible.
|
This issue has been marked as stale because it has been open (more than) 60 days with no activity. Remove the stale label or add a comment saying that you would like to have the label removed otherwise this issue will automatically be closed in 14 days. Note, that you can always re-open a closed issue at any time. |
Describe the bug
When running Bluetooth Tester(tests/bluetooth/tester) for AutoPTS testing on Qemu-cortex-m3, there is ASSERTION FAIL due to Recursive spinlock while processing the events.
To Reproduce
Steps to reproduce the behavior:
$ cd tests/bluetooth/tester
$ mkdir build; cd build
$ cmake -DBOARD=qemu_cortex_m3 -DCONF_FILE=qemu.conf ..
$ make
$ ./autoptsclient-zephyr.py "c:\Users\Tester\Documents\Profile Tuning Suite\autopts\autopts.pqw6" /home/tester/working/zephyr/workspace/zephyr/tests/bluetooth/tester/build/zephyr/zephyr.elf -i 192.168.56.101 -l 192.168.56.1 -c GAP/BROB/BCST/BV-01-C --debug
Expected behavior
No crash or assertion
Impact
Test fails
Screenshots or console output
Output from iut-zephyr.log:
[00:00:00.250,162] bt_hci_core.process_events: count 2
[00:00:00.250,165] bt_hci_core.process_events: ev->state 4
[00:00:00.250,166] bt_hci_core.send_cmd: calling net_buf_get
[00:00:00.250,228] bt_hci_core.send_cmd: calling sem_take_wait
[00:00:00.250,230] bt_hci_core.send_cmd: Sending command 0x1009 (buf 0x20005200) to driver
[00:00:00.250,245] bt_hci_core.bt_send: buf 0x20005200 len 3 type 0
[00:00:00.250,272] bt_hci_core.process_events: ev->state 0
[00:00:00.250,278] bt_conn.bt_conn_prepare_events:
[00:00:00.250,282] bt_hci_core.hci_tx_thread: Calling k_poll with 2 events
[00:00:00.251,492] bt_hci_core.bt_buf_get_cmd_complete: sent_cmd 0x20005200
[00:00:00.251,888] bt_hci_core.hci_cmd_complete: opcode 0x1009
[00:00:00.251,896] bt_hci_core.hci_cmd_done: opcode 0x1009 status 0x00 buf 0x20005200
[00:00:00.251,917] bt_hci_core.bt_hci_cmd_send_sync: opcode 0x1009 status 0x00
[00:00:00.251,938] bt_hci_core.read_bdaddr_complete: status 0
[00:00:00.252,025] bt_hci_core.bt_hci_cmd_create: opcode 0x1002 param_len 0
[00:00:00.252,032] bt_hci_core.bt_hci_cmd_create: buf 0x20005200
[00:00:00.252,036] bt_hci_core.bt_hci_cmd_send_sync: buf 0x20005200 opcode 0x1002 len 3
[00:00:00.252,085] bt_hci_core.process_events: count 2
[00:00:00.252,090] bt_hci_core.process_events: ev->state 4
[00:00:00.252,094] bt_hci_core.send_cmd: calling net_buf_get
[00:00:00.252,098] bt_hci_core.send_cmd: calling sem_take_wait
[00:00:00.252,101] bt_hci_core.send_cmd: Sending command 0x1002 (buf 0x20005200) to driver
[00:00:00.252,106] bt_hci_core.bt_send: buf 0x20005200 len 3 type 0
[00:00:00.252,162] bt_hci_core.process_events: ev->state 0
[00:00:00.252,170] bt_conn.bt_conn_prepare_events:
[00:00:00.252,193] bt_hci_core.hci_tx_thread: Calling k_poll with 2 events
ASSERTION FAIL [z_spin_lock_valid(l)] @ /home/tester/working/zephyr/workspace/zephyr/include/spinlock.h:66
Recursive spinlock
[00:00:00.257,828] bt_hci_core.bt_buf_get_cmd_complete: sent_cmd 0x20005200
***** HARD FAULT *****
Fault escalation (see below)
***** Hardware exception *****
Current thread ID = 0x20002848
Faulting instruction address = 0x3bfc
Fatal fault in essential thread! Spinning...
qemu-system-arm: terminating on signal 15 from pid 29649 (python)
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: