CONFIG_LOG_IMMEDIATE leads to unobvious faults in unrelated rotines due to stack overflow #13897

pfalcon · 2019-02-27T13:54:30Z

Describe the bug
Running samples/net/sockets/echo_server with CONFIG_LOG_IMMEDIATE=y leads to fault on startup and/or on first connect. E.g. for qemu_x86 it leads to fault soon (~0.3s) after startup:

SeaBIOS (version rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org)
Booting from ROM..***** Booting Zephyr OS zephyr-v1.13.0-5183-gddf744deee39 *****


[00:00:00.524,083] <inf> net_config: IPv6 address: fe80::200:5eff:fe00:538b
[00:00:00.560,762] <inf> net_config: IPv6 address: fe80::200:5eff:fe00:538b
uart:~$ ***** CPU Page Fault (error code 0x00000002)
Supervisor thread wrote address 0x00000000
PDE: 0x025 Present, Read-only, User, Execute Enabled
PTE: 0x00 Non-present, Read-only, Supervisor, Execute Enabled
Current thread ID = 0x004020d8
eax: 0x0040a2a8, ebx: 0x00000008, ecx: 0x00000000, edx: 0x00000004
esi: 0x00000001, edi: 0x00000216, ebp: 0x0040b8c8, esp: 0x0040b8c4
eflags: 0x00000087 cs: 0x0008
call trace:
eip: 0x0001ac75
     0x0001962d (0x401108)
     0x0000f1c1 (0x0)
     0x0000e202 (0x0)
     0x0000c35f (0x40dce0)
     0x00012cbc (0x409264)
     0x0000c912 (0x40dce0)
     0x000028cd (0x40dce0)
     0x0000288b (0x4020c0)
Fatal fault in thread 0x004020d8! Aborting.
[00:00:12.602,845] <inf> net_echo_server_sample: Run echo server
[00:00:12.613,100] <inf> net_echo_server_sample: Waiting for TCP connection (IPv6)...
[00:00:12.617,217] <inf> net_echo_server_sample: Waiting for TCP connection (IPv4)...
[00:00:12.621,280] <inf> net_echo_server_sample: Waiting for UDP packets (IPv6)...
[00:00:12.624,167] <inf> net_echo_server_sample: Waiting for UDP packets (IPv4)...
uart:~$

To Reproduce
Steps to reproduce the behavior:
0. master ddf744d

mkdir build; cd build
cmake -DBOARD=qemu_x86
make run
See error above

On frdm_k64f, the fault happens of first connection instead.

Expected behavior
No faults.

Impact
Can't use logging comfortably (#11655), d'oh.

Screenshots or console output

See above.

Environment (please complete the following information):

OS: Linux
Toolchain: Zephyr SDK 0.9.5.

The text was updated successfully, but these errors were encountered:

pfalcon · 2019-02-27T13:56:36Z

eip: 0x0001ac75

void _handle_obj_poll_events(sys_dlist_t *events, u32_t state)
...
        node->prev->next = node->next;
   1ac70:       8b 48 04                mov    0x4(%eax),%ecx
   1ac73:       8b 18                   mov    (%eax),%ebx
   1ac75:       89 19                   mov    %ebx,(%ecx)
        node->next->prev = node->prev;
   1ac77:       8b 18                   mov    (%eax),%ebx
   1ac79:       89 4b 04                mov    %ecx,0x4(%ebx)
        node->next = NULL;
   1ac7c:       c7 00 00 00 00 00       movl   $0x0,(%eax)

rlubos · 2019-02-27T14:14:45Z

Isn't this a duplicate of #13423? Enabling stack sentinel gives more specific output:

uart:~$ ***** Stack Check Fail! *****
Current thread ID = 0x004013a0
eax: 0x0040a42c, ebx: 0x00000247, ecx: 0x00000000, edx: 0x004021c4
esi: 0x00000247, edi: 0x00402140, ebp: 0x0040a574, esp: 0x0040a570
eflags: 0x00000007 cs: 0x0008
call trace:
eip: 0x0001a2cd
     0x00019936 (0x40dc20)
     0x0001a13b (0x4021d4)
     0x0001880d (0x40dc20)
     0x000134b1 (0x4020a0)
     0x0000288b (0x0)

rlubos · 2019-02-27T14:21:17Z

Or not, 0x004013a0 is actually net_mgmt thread. Increasing it's stack fixed the issue under qemu_x86.

pfalcon · 2019-02-27T14:50:04Z

Or not, 0x004013a0 is actually net_mgmt thread.

@rlubos, any hint how to figure out what thread is it by its id "statically", e.g. by looking at zephyr.lst? I can't see a consistent way to do that, but by a chance I see that 0x004020d8 in my stacktrace is sysworkq, will try to increase its stack size.

pfalcon · 2019-02-27T14:54:03Z

No, even using CONFIG_SYSTEM_WORKQUEUE_STACK_SIZE=4096 (from default 1K) doesn't help. And well, it's pretty clear by quoted disassembly that something (logging? ;-) ) calls k_poll with some NULL param.

rlubos · 2019-02-27T14:55:38Z

Statically, I look into zephyr.map, and search for the address, here for example I got:

 .bss.mgmt_thread_data
                0x00000000004013a0       0x44 zephyr/subsys/net/ip/libsubsys__net__ip.a(net_mgmt.c.obj)

But typically, when running on hardware, I prefer info symbol 0x004013a0 from gdb.

pfalcon · 2019-02-27T15:19:37Z

I see, so this boils down to the same infamous gnu ld misfeature, ignored for years: https://sourceware.org/bugzilla/show_bug.cgi?id=16566

jukkar · 2019-03-01T14:29:44Z

It seems that net_mgmt event thread stack is too small. With this setting CONFIG_NET_MGMT_EVENT_STACK_SIZE=1024 I did not see the crash any more.

Edit: @rlubos had already noticed the same fix, sorry for the noise, I missed some of the early discussion in this thread :)

nordic-krch · 2019-03-04T10:52:17Z

When LOG_IMMEDIATE is enabled then log processing happens in the user context, that includes string formatting. Stack usage will rise then by couple of hundreds bytes. Using _prf for string formatting plays the major part as when switched to _vprintk stack usage is decreased by 260 bytes. In #14036 I switched to use _vprintk by default.

nordic-krch · 2019-03-07T08:16:22Z

@pfalcon can we close it? When CONFIG_LOG_IMMEDIATE is used then thread stack usage will increase because logging related actions (_vprintk playing major role) happens in user thread context.

I wonder if we shouldn't enable stack guard by default to avoid misinterpreted issues.

pfalcon · 2019-03-07T08:19:30Z

@nordic-krch: Thanks for all the info and changes, I would need to retest it, and them will be able to close. I remember that I have a few logging-related tickets to retest in my backlog. Thanks for your patience.

pfalcon · 2019-03-07T08:20:13Z

Dropped prio, reassigned to myself in the meantime.

nashif · 2019-03-07T12:15:55Z

every bug needs a priority, if this is not a bug, either close it or track it as something else please. Or better, close as is without removing priority (@nordic-krch should have done that with explanation, @pfalcon retests and reopens if bug still present).

pfalcon · 2019-03-07T13:38:13Z

every bug needs a priority, if this is not a bug, either close it or track it as something else please.

Well, actually I wanted to drop an RFC to the mailing list that we need "waiting for verification" status for bugs. I skipped that, because there're to many fronts of work open already, and I'm not even sure if it should be generic "waiting for feedback" status instead. And here it's not even that something was "fixed", it's that issue was explained. As @nordic-krch writes:

I wonder if we shouldn't enable stack guard by default to avoid misinterpreted issues.

Deciding on things like that what would amount to resolution of this issue.

Or better, close as is without removing priority (@nordic-krch should have done that with explanation, @pfalcon retests and reopens if bug still present).

I don't believe you propose to close unresolved bugs. And nope, I'm not a bug technician here to remember that there're some closed, but unverified bugs. But I definitely try to help with reporting bugs and managing them, that's why I update priorities, etc.

pfalcon · 2019-03-07T13:48:36Z

Ok, so the current status is with master 61bcd76, the issue still occurs (samples/net/sockets/echo_server with CONFIG_LOG_IMMEDIATE=y). So, something needs to be done about that.

I wonder if we shouldn't enable stack guard by default to avoid misinterpreted issues.

@nordic-krch, when you say "stack guard", which exactly Kconfig option do you mean?

pfalcon · 2019-03-07T14:16:44Z

when you say "stack guard", which exactly Kconfig option do you mean?

Ok, I assume you meant CONFIG_STACK_SENTINEL=y . (I'm somewhat mixed up with all those canaries and sentinels.)

pfalcon · 2019-03-07T14:38:40Z

So, here's an example of changes which stems from looking into how to address this ticket: #14155 . As you imagine, that's pretty "far" and partial changes. 3-4 (or maybe 5-6) more changes like that are required to call this ticket resolved.

Sorry, but I don't have time to do all those 3-6 changes now, I'm already occupied with previous changes to make. So, this ticket is likely going to be open for a while. It's ok to move it into 1.15 timeframe of course, once it's clear it doesn't fit into 1.14. Thanks.

andrewboie · 2019-03-08T15:54:24Z

uart:~$ ***** CPU Page Fault (error code 0x00000002)
Supervisor thread wrote address 0x00000000
PDE: 0x025 Present, Read-only, User, Execute Enabled
PTE: 0x00 Non-present, Read-only, Supervisor, Execute Enabled
Current thread ID = 0x004020d8

Are you sure stack overflow is the culprit?
This looks like a thread is writing to a NULL pointer.

jukkar · 2019-05-22T19:32:22Z

Looks like this is no longer a valid issue -> closing.

pfalcon added the bug The issue is a bug, or the PR is fixing a bug label Feb 27, 2019

pfalcon assigned nordic-krch Feb 27, 2019

pfalcon added area: Networking area: Logging labels Feb 27, 2019

pfalcon mentioned this issue Feb 27, 2019

frdm_k64f: samples/net/sockets/echo_server doesn't work #13301

Closed

pfalcon changed the title ~~CONFIG_LOG_IMMEDIATE leads to fault in k_poll functions~~ CONFIG_LOG_IMMEDIATE leads to fault in one of k_poll sub-routines Feb 27, 2019

rljordan-zz added the priority: medium Medium impact/importance bug label Mar 1, 2019

nordic-krch mentioned this issue Mar 4, 2019

logging: Use vprintk for stirng formatting by default #14036

Merged

pfalcon removed the priority: medium Medium impact/importance bug label Mar 7, 2019

pfalcon assigned pfalcon and unassigned nordic-krch Mar 7, 2019

nashif added the priority: medium Medium impact/importance bug label Mar 7, 2019

pfalcon added priority: low Low impact/importance bug and removed priority: medium Medium impact/importance bug labels Mar 7, 2019

pfalcon changed the title ~~CONFIG_LOG_IMMEDIATE leads to fault in one of k_poll sub-routines~~ CONFIG_LOG_IMMEDIATE leads to fault in one of k_poll sub-routines due to extended stack usage Mar 7, 2019

pfalcon mentioned this issue Mar 7, 2019

arch: x86: fatal: If possible, print thread name in crash dump #14155

Merged

pfalcon changed the title ~~CONFIG_LOG_IMMEDIATE leads to fault in one of k_poll sub-routines due to extended stack usage~~ CONFIG_LOG_IMMEDIATE leads to unobvious faults in unrelated rotines due to stack overflow Mar 7, 2019

jukkar closed this as completed May 22, 2019

CONFIG_LOG_IMMEDIATE leads to unobvious faults in unrelated rotines due to stack overflow #13897

CONFIG_LOG_IMMEDIATE leads to unobvious faults in unrelated rotines due to stack overflow #13897

Comments

pfalcon commented Feb 27, 2019

pfalcon commented Feb 27, 2019

Uh oh!

rlubos commented Feb 27, 2019

Uh oh!

rlubos commented Feb 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pfalcon commented Feb 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pfalcon commented Feb 27, 2019

Uh oh!

rlubos commented Feb 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pfalcon commented Feb 27, 2019

Uh oh!

jukkar commented Mar 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nordic-krch commented Mar 4, 2019

Uh oh!

nordic-krch commented Mar 7, 2019

Uh oh!

pfalcon commented Mar 7, 2019

Uh oh!

pfalcon commented Mar 7, 2019

Uh oh!

nashif commented Mar 7, 2019

Uh oh!

pfalcon commented Mar 7, 2019

Uh oh!

pfalcon commented Mar 7, 2019

Uh oh!

pfalcon commented Mar 7, 2019

Uh oh!

pfalcon commented Mar 7, 2019

Uh oh!

andrewboie commented Mar 8, 2019

Uh oh!

jukkar commented May 22, 2019

Uh oh!

rlubos commented Feb 27, 2019 •

edited

Loading

pfalcon commented Feb 27, 2019 •

edited

Loading

rlubos commented Feb 27, 2019 •

edited

Loading

jukkar commented Mar 1, 2019 •

edited

Loading