CONFIG_BT_HCI_TX_STACK_SIZE is too small #13585

pdunaj · 2019-02-20T16:05:46Z

Describe the bug
I have pca10059 configured as central (running ncs nrf_desktop). The device crashes during connection attempt with a peripheral.
CONFIG_BT_HCI_TX_STACK_SIZE is set to 640B in Zephyr. If extend it to 1024B program works (I have not tried lower values yet).

We have not seen it in the past. The thing is that due to recent bugs we have MPU stack guard enabled by default now.

To Reproduce
Steps to reproduce the behavior:
Build nrf_desktop for pca10059 and try to connect to peripheral (DK working as mouse from the same project should work).
Try to connect and observe crash.

Expected behavior
Should not crash.

Impact
Showstopper.

Screenshots or console output


[00000204] <inf> ble_scan: Scan started
--
[00000213] <inf> event_manager: e: usb_state_event id:0x200000a9 state:POWERED
[00008655] <inf> ble_scan: Filters matched. fd:a0:61:de:7d:07 (random) connectable
[00008679] <inf> ble_scan: Connecting done
[00009066] <inf> usb_state: USB HID enabled
[00009069] <inf> event_manager: e: usb_state_event id:0x200000a9 state:ACTIVE
[00009072] <inf> event_manager: e: hid_report_subscription_event MOUSE report notification enabled by 0x200000a9
[00009944] <wrn> usb_hid: len 143 doesn't match Report Descriptor size
[00012038] <inf> ble_state: Connected to fd:a0:61:de:7d:07 (random)
[00012044] <wrn> bt_ctlr_hci: handle: 0x0000, CSA: 1.
[00039955] <inf> ble_state: Security with fd:a0:61:de:7d:07 (random) level 2
[00039959] <inf> event_manager: e: ble_peer_event id=0x2000227c SECURED
***** MPU FAULT *****
Stacking error
Data Access Violation
MMFAR Address: 0x20006e98
***** Hardware exception *****
Current thread ID = 0x200021f0
Faulting instruction address = 0x2aa7c
Fatal fault in thread 0x200021f0! Aborting.
[ASSERTION FAIL [out_ctx->control_block->offset <= out_ctx->size] @ /home/pdunaj/work/ncs/zephyr/subsys/logging/log_output.c:101

Environment (please complete the following information):
rev: a76f833 (ncs) from ec424b7 (zephyr)

Additional context
N/A

The text was updated successfully, but these errors were encountered:

pdunaj · 2019-02-20T16:17:25Z

I am wondering if callbacks informing about connection/filter status/etc. are called from this thread.

jhedberg · 2019-02-20T16:20:11Z

The definition is actually as follows:

config BT_HCI_TX_STACK_SIZE                                                     
        int                                                                     
        default 512 if BT_H4                                                    
        default 512 if BT_H5                                                    
        default 416 if BT_SPI                                                   
        default 940 if BT_CTLR && NO_OPTIMIZATIONS                              
        default 640 if BT_CTLR                                                  
        default 512 if BT_USERCHAN                                              
        # Even if no driver is selected the following default is still          
        # needed e.g. for unit tests.                                           
        default 512

So I suppose it's the if BT_CTLR line(s) that you want to modify

carlescufi · 2019-02-20T16:27:36Z

@pdunaj can you please send a PR so that you set a reasonable value based on your measurements?

pdunaj · 2019-02-20T16:30:59Z

Hi @carlescufi , sure. Will look for something sensible tomorrow and check if any callback can affect it.

Hi @jhedberg, I actually prefer to increase it only when BT_CENTRAL is enabled. We don't seem to see any problem on the peripheral side.

jhedberg · 2019-02-20T16:31:17Z

Btw, if the BT_CTLR && NO_OPTIMIZATIONS value ends up being close to the BT_CTLR one I'd just merge these to a single one (like it used to be in the past)

jhedberg · 2019-02-20T16:33:11Z

Hi @jhedberg, I actually prefer to increase it only when BT_CENTRAL is enabled. We don't seem to see any problem on the peripheral side.

@pdunaj It'd be good to understand the root cause first. The reason this Kconfig option is not settable by the application is that the assumption has been that the app can't influence this stack's usage. Now it sounds like this might not be true, in which case we might want to revisit making this settable by the application.

jhedberg · 2019-02-20T16:37:40Z

@pdunaj or do you mean that simply enabling BT_CENTRAL increases it (i.e. doesn't depend on the app doing central-related actions at runtime)? Meaning, something in the controller-side TX code-path changes when BT_CENTRAL is set? In that case it'd be justified to put the necessary logic in Kconfig (not that I'm particularly happy about the level of complexity this is growing to)

pdunaj · 2019-02-20T16:43:53Z

@jhedberg I mean adding a line default xyz if BT_CTLR && BT_CENTRAL to the Kconfig. I prefer not to extend RAM usage as we are tight on it. Soon I will look for 2-3kB to be freed in BLE host btw so increasing it anywhere is not a good idea.

First I plan to check if any callback can be called from this function. I guess not as I expect rx side to handle it. Still I don't know this stack so I will have to have a look.

pdunaj · 2019-02-21T11:26:42Z

0x20002D94 BT RX      (real size 2016): unused 1096     usage 920 / 2016 (45 %)
0x20002D0C BT RX pri  (real size 608):  unused 32       usage 576 / 608 (94 %)
0x200023F4 BT ECC     (real size 1068): unused 104      usage 964 / 1068 (90 %)
0x200024DC BT TX      (real size 992):  unused 96       usage 896 / 992 (90 %)
0x200041C0 sysworkq   (real size 992):  unused 128      usage 864 / 992 (87 %)
0x200017E4 shell_rtt  (real size 2016): unused 816      usage 1200 / 2016 (59 %)
0x200013F0 logging    (real size 992):  unused 760      usage 232 / 992 (23 %)
0x20003C24 idle       (real size 96):   unused 40       usage 56 / 96 (58 %)
prio recv thread stack (real size 640): unused 32       usage 608 / 640 (95 %)
recv thread stack (real size 2048):     unused 1096     usage 952 / 2048 (46 %)

The kernel stacks information is above. It seems that tx stack set to 1024 is actually not too big.

When I enabled shell I also got fault on PRI RX stack. I expanded this one to 640 and it also is not too big it seems.

So it seems that the problem may be caused by increased RAM consumption cause by log messages.

pdunaj · 2019-02-21T11:30:33Z

I have also tried to see why we have this logs enabled as we select BT_DEBUG_NONE... and found out that BT_DEBUG is being selected by ncs's discovery manager's option BT_GATT_DM_DATA_PRINT. If anything this option should depend on !BT_DEBUG_NONE being selected (any level except none) which actually would be depending on BT_DEBUG. @kapi-no

I will add a change to NCS and see if stack problems are gone if debugs are disabled.

There is still however an open question of how to expand this stacks - if debugs are enabled these will be too small and crash will occur. We have stack guard but unless somebody has it too they will not notice stack overflow. I think we should expand the stacks anyway but depend on BT_DEBUG option being set.

jhedberg · 2019-02-21T11:32:36Z

So it seems that the problem may be caused by increased RAM consumption cause by log messages.

I think it would be good to have someone from the log subsystem maintainers to look at this. One of the promises (or at least hopes) of moving log processing to a deferred thread was that the overhead we were suffering from due to printk would be gone, or at least drastically reduced. IIRC the problem with the new log subsystem is that it creates extra stack variables for every log call, so the more logs you have in a function the more stack it consumes. When I looked at this it seemed to me like the variables themselves (at least in the Bluetooth case) were identical in content for each c-file, so it should hopefully be possible to optimize this somehow.

pdunaj · 2019-02-21T11:42:55Z

I removed the config option that caused selection of BT_DEBUG - logs are gone. Still I need to keep the stack size increased.

0x20002D94 BT RX      (real size 2016): unused 1176     usage 840 / 2016 (41 %)
0x20002D0C BT RX pri  (real size 608):  unused 56       usage 552 / 608 (90 %)
0x200023F4 BT ECC     (real size 1068): unused 104      usage 964 / 1068 (90 %)
0x200024DC BT TX      (real size 992):  unused 96       usage 896 / 992 (90 %)
0x200041C0 sysworkq   (real size 992):  unused 128      usage 864 / 992 (87 %)
0x200017E4 shell_rtt  (real size 2016): unused 816      usage 1200 / 2016 (59 %)
0x200013F0 logging    (real size 992):  unused 760      usage 232 / 992 (23 %)
0x20003C24 idle       (real size 96):   unused 40       usage 56 / 96 (58 %)

pdunaj · 2019-02-21T12:54:24Z

I think I found the culprit (at least for RX pri thread). When I get rid of stack analysis code (look for CONFIG_INIT_STACKS in prio_recv_thread) I get the following stack usage:

0x20002D0C BT RX pri  (real size 608):  unused 432      usage 176 / 608 (28 %)

It seems that this is caused by printing of the value. As when I comment out printk from stack_analysis only I get 216/608 stack usage, which is still low.
Maybe using logger instead of printk would help. I will check that. Otherwise we could throw away this stack usage printing.

@nordic-krch

pdunaj · 2019-02-21T13:08:23Z

I obtain unused stack size in the thread and use logger to report it instead of printk. Below is the stack usage:

0x20002D0C BT RX pri  (real size 608):  unused 352      usage 256 / 608 (42 %)

So it seems we must get rid of printk or accept the fact that this thread will have to have bigger stack if CONFIG_INIT_STACKS is enabled. We can also remove code printing stack usage altogether (this info can be obtained from shell command).
@jhedberg , do you have any opinion on this one?

pdunaj · 2019-02-21T13:26:45Z

I changed STACK_ANALYZE macro so it would use LOG and values are as follows:

0x20002D94 BT RX      (real size 2016): unused 1168     usage 848 / 2016 (42 %)
0x20002D0C BT RX pri  (real size 608):  unused 352      usage 256 / 608 (42 %)
0x200023F4 BT ECC     (real size 1068): unused 72       usage 996 / 1068 (93 %)
0x200024DC BT TX      (real size 992):  unused 104      usage 888 / 992 (89 %)
0x200041C0 sysworkq   (real size 992):  unused 136      usage 856 / 992 (86 %)
0x200017E4 shell_rtt  (real size 2016): unused 816      usage 1200 / 2016 (59 %)
0x200013F0 logging    (real size 992):  unused 760      usage 232 / 992 (23 %)
0x20003C24 idle       (real size 96):   unused 40       usage 56 / 96 (58 %)

I would still prefer to update the macro as I see it being used in other BT modules in various occasions.

jhedberg · 2019-02-21T13:29:11Z

@pdunaj updating STACK_ANALYZE to use logger sounds like a good approach to me. All this kind of places in the code that do stack analysis exist mainly for apps without the ability to issue the "kernel stacks" shell command, so I'd consider making these places dependent on the kernel shell module (and whatever other options "kernel stacks" depends on). Anyway, that can be done as a follow-up work.

pdunaj · 2019-02-21T13:50:47Z

Change for stack analyze macro:
#13626

pdunaj · 2019-02-21T13:51:30Z

There is still an open question why tx thread need for RAM is higher. I will try to find it out and get back here.

pdunaj · 2019-02-26T13:59:14Z

I think I am done with the investigation. Stack is bumped up by around nearly 400B when pairing is complete. This happens as smp_check_complete is called from bt_conn_notify_tx (as a callback). When this is done keys are stored and this process not only goes deep but also declares some arrays on a stack. All in all stack usage goes up.

Hi @nashif , this is not low as it causes stack overflow. People will not even notice that unless they have stack guards enabled.

pdunaj · 2019-02-26T14:03:31Z

At the moment the only solution for that is to increase the stack for tx thread. This is however problematic as some SoCs will not have memory for that.

@sjanc , @jhedberg , maybe it would be possible to defer key store to worqueue thread?

pdunaj · 2019-02-26T14:14:19Z

#13789 created for stack expansion

pdunaj added bug The issue is a bug, or the PR is fixing a bug area: Bluetooth labels Feb 20, 2019

jhedberg assigned cvinayak and carlescufi Feb 20, 2019

pdunaj mentioned this issue Feb 21, 2019

samples: nrf_desktop: Expand workqueue stack on dongle nrfconnect/sdk-nrf#490

Merged

nashif added the priority: low Low impact/importance bug label Feb 21, 2019

pdunaj mentioned this issue Feb 27, 2019

bluetooth: host: Expand TX thread stack for BT_CENTRAL #13789

Merged

carlescufi closed this as completed in #13789 Mar 5, 2019

pdunaj mentioned this issue Mar 5, 2019

Save bonds in workqueue thread instead on tx thread #14065

Closed

pdunaj mentioned this issue Mar 5, 2019

drivers: spi: nRF SPI add control of MISO pin pull #13950

Merged

CONFIG_BT_HCI_TX_STACK_SIZE is too small #13585

CONFIG_BT_HCI_TX_STACK_SIZE is too small #13585

Comments

pdunaj commented Feb 20, 2019

pdunaj commented Feb 20, 2019

Uh oh!

jhedberg commented Feb 20, 2019

Uh oh!

carlescufi commented Feb 20, 2019

Uh oh!

pdunaj commented Feb 20, 2019

Uh oh!

jhedberg commented Feb 20, 2019

Uh oh!

jhedberg commented Feb 20, 2019

Uh oh!

jhedberg commented Feb 20, 2019

Uh oh!

pdunaj commented Feb 20, 2019

Uh oh!

pdunaj commented Feb 21, 2019

Uh oh!

pdunaj commented Feb 21, 2019

Uh oh!

jhedberg commented Feb 21, 2019

Uh oh!

pdunaj commented Feb 21, 2019

Uh oh!

pdunaj commented Feb 21, 2019

Uh oh!

pdunaj commented Feb 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pdunaj commented Feb 21, 2019

Uh oh!

jhedberg commented Feb 21, 2019

Uh oh!

pdunaj commented Feb 21, 2019

Uh oh!

pdunaj commented Feb 21, 2019

Uh oh!

pdunaj commented Feb 26, 2019

Uh oh!

pdunaj commented Feb 26, 2019

Uh oh!

pdunaj commented Feb 26, 2019

Uh oh!

pdunaj commented Feb 21, 2019 •

edited

Loading