Mesh network traffic overflow ungraceful stop. (MMFAR Adress: 0x0) #14714

WilliamGFish · 2019-03-20T11:20:09Z

When running a BLE Mesh with high interactivity program throws an unhandled exception, whilst pushing the boundaries of radio/network traffic saturation.
It appears that when the either the RX or TX buffer are overloaded they are unable to process acknowledgement messages therefore moves to remove them from list (dlist.h: sys_dlist_remove). Unfortunately there may be a NULL pointer to list node.

Consistently points to failing at:
//
static inline void sys_dlist_remove(sys_dnode_t *node)
{
node->prev->next = node->next; {fails at this point, see below}
node->next->prev = node->prev;
sys_dnode_init(node);
}
//

To Reproduce
Devices used 2x nrf52-PCA10040, nrf52840-PCA10056 and Particle Boron
Similar issues with BBC Micro-bit yet unable to debug

Steps to reproduce the behaviour:

Create Mesh with SRV Model with publish period of 300ms (Broadcast: 0xc000)
Define each board as Model Client and Server
Define each board as Repeater
Config & Run
Wait, can take 20 minutes

I have increased the buffer sizes and RX & TX settings which only delayed failure.
CONFIG_BT_MESH_ADV_BUF_COUNT=256
CONFIG_BT_MESH_TX_SEG_MAX=32
CONFIG_BT_MESH_TX_SEG_MSG_COUNT=24
CONFIG_BT_MESH_RX_SDU_MAX=384
CONFIG_BT_MESH_RX_SEG_MSG_COUNT=24

Expected behaviour
Gracefully catch error rather than "crash"

Impact
This is but an annoyance and potential showstopper as with larger number of nodes stressed networks will be increasing likely.

Environment (please complete the following information):

OS: Windows10
Toolchain: Zephyr SDK, latest Git 1.14.0 rc2
Booting Zephyr OS v1.14.0-rc1-1294-g3dea408405fd

Screenshots or console output

Debugged failure
***** MPU FAULT *****
Data Access Violation
MMFAR Address: 0x0
***** Hardware exception *****
Current thread ID = 0x20003360
Faulting instruction address = 0x13b44
Fatal fault in ISR! Spinning...

Original pre-debug (consistent address across 2 boards)
***** MPU FAULT *****
Data Access Violation
MMFAR Address: 0x0
***** Hardware exception *****
Current thread ID = 0x2000228c
Faulting instruction address = 0xe174
Fatal fault in ISR! Spinning...

carlescufi · 2019-03-20T11:52:37Z

@WilliamGFish can you enable CONFIG_HW_STACK_PROTECTION to see if it's a stack overflow?
Also, can you try increasing the advertising stack size like so: #14680

jhedberg · 2019-03-20T12:13:23Z

Are you able to get a backtrace of where exactly the sys_dlist_remove() call comes from? Am I right to interpret Fatal fault in ISR! Spinning... as meaning this all happens in ISR context? I don't think the host side has any ISR code, so this might mean that the issue is occurring in the controller.

carlescufi · 2019-03-20T12:28:05Z

@jhedberg right, but doesn't really make sense because we don't use dlist in the controller except for the HCI driver when using Controller to Host flow control and even then it's only from a thread, never from ISR.

WilliamGFish · 2019-03-20T12:43:01Z

Made changes as suggested and results are (minor editting):
--- 9999 messages dropped ---

[00:07:42.128,662] [1;33m bt_mesh_transport: Replay: src 0x0005 dst 0xc000 seq 0x0006eb[0m
[00:07:42.244,506] [1;33m bt_mesh_transport: Replay: src 0x000f dst 0xc000 seq 0x0002d6[0m
{'sensors':[{'nodeid': 22}, {'propid':'07a0'},{'0x07a0': 40442}]}

[00:07:42.289,306] [1;31m bt_mesh_net: Out of relay buffers[0m
[00:07:42.422,607] [1;33m bt_mesh_transport: Replay: src 0x0005 dst 0xc000 seq 0x0006ec[0m
{'sensors':[{'nodeid': 22}, {'propid':'07a0'},{'0x07a0': 21141}]}

[00:07:42.498,870] [1;31m bt_mesh_net: Out of relay buffers[0m
[00:07:42.599,151] [1;33m bt_mesh_transport: Replay: src 0x0005 dst 0xc000 seq 0x00078f[0m
[00:07:42.656,616] [1;33m bt_mesh_transport: Replay: src 0x0005 dst 0xc000 seq 0x000790[0m
***** MPU FAULT *****

Data Access Violation

MMFAR Address: 0x0

***** Hardware exception *****

Current thread ID = 0x20003358
Faulting instruction address = 0x13ab0

Fatal fault in ISR! Spinning...

jhedberg · 2019-03-20T13:09:25Z

@carlescufi host doesn't use dlist directly either. A couple of kernel functionalities that do use dlist (based on git grep), which we end up using indirectly: scheduler, timers, k_poll.

carlescufi · 2019-03-20T13:32:42Z

@WilliamGFish can you try to get a backtrace if possible as @jhedberg suggested to see who is calling that function?

WilliamGFish · 2019-03-20T13:39:10Z

I do my best, a newbie regards this. Using Ozone so I'll find out what i can.

carlescufi · 2019-03-20T13:56:42Z

@WilliamGFish sure, thanks for all the info so far! We really appreciate any help trying to figure out what the issue is. On Ozone you should have a View -> Call Stack menu you can use.

WilliamGFish · 2019-03-20T14:46:54Z

Guys is this meaningful?

WilliamGFish · 2019-03-20T14:50:53Z

This could be useful, the call stack post crash updated when clicking with:

@ FFFF FFFF same as caller: 0001 2C80

//
_for (slab = k_mem_slab_list_start;
00012C80 LDR R3, =sem
00012C82 LDR R0, =sem
00012C84 PUSH {R4-R6, LR}
//

WilliamGFish · 2019-03-21T14:05:06Z

Update:
MMFAR occurs when BT_SETTINGS =n

When BT_SETTINGS =y the board gets stuck in endless thread loop (see below):

jhedberg · 2019-03-21T14:53:55Z

Looks like some scheduling issue? I remember there being another one not so long ago where a list was pointing back to itself, resulting in an infinite loop. I wonder if this was fixed. @WilliamGFish are you using latest master?

WilliamGFish · 2019-03-21T15:34:39Z

Just did a fetch from Master this morning and again now.
Oddly when it boots it shows as 1.14.0 rc1 not rc2...

***** Booting Zephyr OS v1.14.0-rc1-1366-gcb44b7e76861 *****

jhedberg · 2019-03-21T15:39:05Z

@WilliamGFish I think #12726 is the other issue I was thinking of - the use case sounds quite similar as well

WilliamGFish · 2019-03-21T18:12:17Z

@andyross Maybe something that could help with:
Dead loop of the kernel during Bluetooth Mesh pressure communication #12726

Thoughts or ideas?

WilliamGFish · 2019-03-25T11:55:12Z

I have fetched the latest commits and have run three (3) boards for several hours and the issue no long appears to arise. I suspect that it was 66d339a that resolved the underlying problem.

If the issue resurfaces i shall reopen.

carlescufi · 2019-03-25T12:24:09Z

@WilliamGFish thank you very much for getting back to this and closing it.

WilliamGFish added the bug The issue is a bug, or the PR is fixing a bug label Mar 20, 2019

carlescufi assigned jhedberg Mar 20, 2019

carlescufi added the area: Bluetooth label Mar 20, 2019

rljordan-zz added the priority: medium Medium impact/importance bug label Mar 22, 2019

WilliamGFish closed this as completed Mar 25, 2019

Mesh network traffic overflow ungraceful stop. (MMFAR Adress: 0x0) #14714

Mesh network traffic overflow ungraceful stop. (MMFAR Adress: 0x0) #14714

Comments

WilliamGFish commented Mar 20, 2019

carlescufi commented Mar 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhedberg commented Mar 20, 2019

Uh oh!

carlescufi commented Mar 20, 2019

Uh oh!

WilliamGFish commented Mar 20, 2019

Uh oh!

jhedberg commented Mar 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carlescufi commented Mar 20, 2019

Uh oh!

WilliamGFish commented Mar 20, 2019

Uh oh!

carlescufi commented Mar 20, 2019

Uh oh!

WilliamGFish commented Mar 20, 2019

Uh oh!

WilliamGFish commented Mar 20, 2019

Uh oh!

WilliamGFish commented Mar 21, 2019

Uh oh!

jhedberg commented Mar 21, 2019

Uh oh!

WilliamGFish commented Mar 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhedberg commented Mar 21, 2019

Uh oh!

WilliamGFish commented Mar 21, 2019

Uh oh!

WilliamGFish commented Mar 25, 2019

Uh oh!

carlescufi commented Mar 25, 2019

Uh oh!

carlescufi commented Mar 20, 2019 •

edited

Loading

jhedberg commented Mar 20, 2019 •

edited

Loading

WilliamGFish commented Mar 21, 2019 •

edited

Loading