Description
First, I have no pretention in stating it is a bug, it may be the normal way of working of the BLE stack architecture at this moment.
In this case, it could be reclassified as an enhancement request if relevant. I may be expecting too much.
Preliminary context
I am using a classic dual-chip configuration:
- Host-side : custom board (stm32h7, similar to a nucleo-h743zi), using a host-build only
- 3 services (HID, BAS, DIS)
- HID reporting rate @133hz (7.5ms)
- Controller-side : standard hci_spi sample project flashed on nrf52840 addon board connected through SPI@1MHz
Describe the bug
From the host, using a simple bt_gatt_notify_cb to update a BLE service characteristic (let's say a 17-bytes long HID report update).
I have noticed this call is blocking (OK) for a unexpectedly long period of time (NOK) of approximately 1.5ms degrading others thread "performances" in a non-negligible way.
It's basically a 1.5ms penalty every 7.5ms that is spent in a busy-wait style fashion starving other actions in the system.
More details below on my investigations.
On the other hand, a similar system reporting through USB-HID at the same rate, do not suffer from the same deficiency and a reasonable 0.1 <-> 0.7ms is spent.
- What have you tried to diagnose or workaround this issue?
- Independently of my full system I have been able to reproduce this blocking behavior by setting up a sample test application for the host side, reporting a classic BAS notification (battery level %) periodically (every 50ms down to every 7.5ms (target)) and measuring by software the time spent for a bt_gatt_notify_cb operation.
// Sample test - pseudo code
period = 50ms;
while(1) {
time;
bt_gatt_notify(); // reporting BAS notification
elapsed = now - time;
printf("%d", elapsed); // connected:~750us <--> ~980us, not connected: ~5us
wait( period - elapsed );
}
The host-side prj.conf
, I use for the sample test is provided below.
prj.conf.txt
-
I have also tried to pimp my BLE stack configuration using the throughput maximization sample project but with no effect.
-
When debugging, I noticed that the vast majority of time was spent in "waiting" for a rescheduling operation to complete in the implementation of the BLE stack, for some reason.
In a nutshell my call stack (host):
bt_gatt_notify_cb // app waiting here
// GATT
// ATT
// ACL
// KERNEL QUEUE - k_queue_append_list
z_reschedule(&queue->lock, key); <<<< Waiting here for ~80% of the time
// HCI_SPI transfer happening at some point
// ... returns and continues normally up to the next notification
// Transfer OTA happening on controller -> at next connection interval -> OK
// callback notification sent received -> OK
- I have not tried yet but I envisioned moving to a single-chip configuration (Host + Controller using HCI RAM) running on nrf52840 and connecting my current "Host" MCU with a custom protocol over SPI to receive raw notifications.
This way I could have a fully externalized BLE stack not impacting the rest of my system (hopefully)
With the big cons of maintaining two FWs instead of a single one....
Do you think it's a valid way of working around the issue?
- Is this a regression? If yes, have you been able to "git bisect" it to a specific commit?
I don't think it's a regression, probably the way it has always worked
Expected behavior
From a system architecture point of view, I was expecting the dual-chip configuration to work in tandem:
- Host-side -> serializing and publishing notification very fast over HCI -> achieving a minimal time waiting
- Controller-side -> offloading the full BLE-OTA transfer
- Tuned with sufficiently large HCI events queues according to the report rate/size/types
It's a pity to have two powerful MCUs (STM32H7 on host + nrf52840 on controller) and having such a poor behavior.
Impact
As stated this "poor notification performances" are very annoying as it considerably degrades the full system performance, and sometimes delays time-critical operations (acquisition threads...).
Environment (please complete the following information):
- Building on Windows 10, connecting my device (BLE) on the same laptop.
- Toolchain
- Zephyr SDK 0.14.1
- Zephyr v3.0 > Similar behavior on Zephyr v3.1