Skip to content

CONFIG_LOG_IMMEDIATE leads to unobvious faults in unrelated rotines due to stack overflow #13897

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pfalcon opened this issue Feb 27, 2019 · 19 comments
Assignees
Labels
area: Logging area: Networking bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug

Comments

@pfalcon
Copy link
Collaborator

pfalcon commented Feb 27, 2019

Describe the bug
Running samples/net/sockets/echo_server with CONFIG_LOG_IMMEDIATE=y leads to fault on startup and/or on first connect. E.g. for qemu_x86 it leads to fault soon (~0.3s) after startup:

SeaBIOS (version rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org)
Booting from ROM..***** Booting Zephyr OS zephyr-v1.13.0-5183-gddf744deee39 *****


[00:00:00.524,083] <inf> net_config: IPv6 address: fe80::200:5eff:fe00:538b
[00:00:00.560,762] <inf> net_config: IPv6 address: fe80::200:5eff:fe00:538b
uart:~$ ***** CPU Page Fault (error code 0x00000002)
Supervisor thread wrote address 0x00000000
PDE: 0x025 Present, Read-only, User, Execute Enabled
PTE: 0x00 Non-present, Read-only, Supervisor, Execute Enabled
Current thread ID = 0x004020d8
eax: 0x0040a2a8, ebx: 0x00000008, ecx: 0x00000000, edx: 0x00000004
esi: 0x00000001, edi: 0x00000216, ebp: 0x0040b8c8, esp: 0x0040b8c4
eflags: 0x00000087 cs: 0x0008
call trace:
eip: 0x0001ac75
     0x0001962d (0x401108)
     0x0000f1c1 (0x0)
     0x0000e202 (0x0)
     0x0000c35f (0x40dce0)
     0x00012cbc (0x409264)
     0x0000c912 (0x40dce0)
     0x000028cd (0x40dce0)
     0x0000288b (0x4020c0)
Fatal fault in thread 0x004020d8! Aborting.
[00:00:12.602,845] <inf> net_echo_server_sample: Run echo server
[00:00:12.613,100] <inf> net_echo_server_sample: Waiting for TCP connection (IPv6)...
[00:00:12.617,217] <inf> net_echo_server_sample: Waiting for TCP connection (IPv4)...
[00:00:12.621,280] <inf> net_echo_server_sample: Waiting for UDP packets (IPv6)...
[00:00:12.624,167] <inf> net_echo_server_sample: Waiting for UDP packets (IPv4)...
uart:~$ 

To Reproduce
Steps to reproduce the behavior:
0. master ddf744d

  1. mkdir build; cd build
  2. cmake -DBOARD=qemu_x86
  3. make run
  4. See error above

On frdm_k64f, the fault happens of first connection instead.

Expected behavior
No faults.

Impact
Can't use logging comfortably (#11655), d'oh.

Screenshots or console output

See above.

Environment (please complete the following information):

  • OS: Linux
  • Toolchain: Zephyr SDK 0.9.5.
@pfalcon pfalcon added the bug The issue is a bug, or the PR is fixing a bug label Feb 27, 2019
@pfalcon
Copy link
Collaborator Author

pfalcon commented Feb 27, 2019

eip: 0x0001ac75

void _handle_obj_poll_events(sys_dlist_t *events, u32_t state)
...
        node->prev->next = node->next;
   1ac70:       8b 48 04                mov    0x4(%eax),%ecx
   1ac73:       8b 18                   mov    (%eax),%ebx
   1ac75:       89 19                   mov    %ebx,(%ecx)
        node->next->prev = node->prev;
   1ac77:       8b 18                   mov    (%eax),%ebx
   1ac79:       89 4b 04                mov    %ecx,0x4(%ebx)
        node->next = NULL;
   1ac7c:       c7 00 00 00 00 00       movl   $0x0,(%eax)

@rlubos
Copy link
Collaborator

rlubos commented Feb 27, 2019

Isn't this a duplicate of #13423? Enabling stack sentinel gives more specific output:

uart:~$ ***** Stack Check Fail! *****
Current thread ID = 0x004013a0
eax: 0x0040a42c, ebx: 0x00000247, ecx: 0x00000000, edx: 0x004021c4
esi: 0x00000247, edi: 0x00402140, ebp: 0x0040a574, esp: 0x0040a570
eflags: 0x00000007 cs: 0x0008
call trace:
eip: 0x0001a2cd
     0x00019936 (0x40dc20)
     0x0001a13b (0x4021d4)
     0x0001880d (0x40dc20)
     0x000134b1 (0x4020a0)
     0x0000288b (0x0)

@rlubos
Copy link
Collaborator

rlubos commented Feb 27, 2019

Or not, 0x004013a0 is actually net_mgmt thread. Increasing it's stack fixed the issue under qemu_x86.

@pfalcon
Copy link
Collaborator Author

pfalcon commented Feb 27, 2019

Or not, 0x004013a0 is actually net_mgmt thread.

@rlubos, any hint how to figure out what thread is it by its id "statically", e.g. by looking at zephyr.lst? I can't see a consistent way to do that, but by a chance I see that 0x004020d8 in my stacktrace is sysworkq, will try to increase its stack size.

@pfalcon
Copy link
Collaborator Author

pfalcon commented Feb 27, 2019

No, even using CONFIG_SYSTEM_WORKQUEUE_STACK_SIZE=4096 (from default 1K) doesn't help. And well, it's pretty clear by quoted disassembly that something (logging? ;-) ) calls k_poll with some NULL param.

@pfalcon pfalcon changed the title CONFIG_LOG_IMMEDIATE leads to fault in k_poll functions CONFIG_LOG_IMMEDIATE leads to fault in one of k_poll sub-routines Feb 27, 2019
@rlubos
Copy link
Collaborator

rlubos commented Feb 27, 2019

Statically, I look into zephyr.map, and search for the address, here for example I got:

 .bss.mgmt_thread_data
                0x00000000004013a0       0x44 zephyr/subsys/net/ip/libsubsys__net__ip.a(net_mgmt.c.obj)

But typically, when running on hardware, I prefer info symbol 0x004013a0 from gdb.

@pfalcon
Copy link
Collaborator Author

pfalcon commented Feb 27, 2019

I see, so this boils down to the same infamous gnu ld misfeature, ignored for years: https://sourceware.org/bugzilla/show_bug.cgi?id=16566

@rljordan-zz rljordan-zz added the priority: medium Medium impact/importance bug label Mar 1, 2019
@jukkar
Copy link
Member

jukkar commented Mar 1, 2019

It seems that net_mgmt event thread stack is too small. With this setting CONFIG_NET_MGMT_EVENT_STACK_SIZE=1024 I did not see the crash any more.

Edit: @rlubos had already noticed the same fix, sorry for the noise, I missed some of the early discussion in this thread :)

@nordic-krch
Copy link
Collaborator

When LOG_IMMEDIATE is enabled then log processing happens in the user context, that includes string formatting. Stack usage will rise then by couple of hundreds bytes. Using _prf for string formatting plays the major part as when switched to _vprintk stack usage is decreased by 260 bytes. In #14036 I switched to use _vprintk by default.

@nordic-krch
Copy link
Collaborator

@pfalcon can we close it? When CONFIG_LOG_IMMEDIATE is used then thread stack usage will increase because logging related actions (_vprintk playing major role) happens in user thread context.

I wonder if we shouldn't enable stack guard by default to avoid misinterpreted issues.

@pfalcon
Copy link
Collaborator Author

pfalcon commented Mar 7, 2019

@nordic-krch: Thanks for all the info and changes, I would need to retest it, and them will be able to close. I remember that I have a few logging-related tickets to retest in my backlog. Thanks for your patience.

@pfalcon pfalcon removed the priority: medium Medium impact/importance bug label Mar 7, 2019
@pfalcon pfalcon assigned pfalcon and unassigned nordic-krch Mar 7, 2019
@pfalcon
Copy link
Collaborator Author

pfalcon commented Mar 7, 2019

Dropped prio, reassigned to myself in the meantime.

@nashif
Copy link
Member

nashif commented Mar 7, 2019

every bug needs a priority, if this is not a bug, either close it or track it as something else please. Or better, close as is without removing priority (@nordic-krch should have done that with explanation, @pfalcon retests and reopens if bug still present).

@nashif nashif added the priority: medium Medium impact/importance bug label Mar 7, 2019
@pfalcon pfalcon added priority: low Low impact/importance bug and removed priority: medium Medium impact/importance bug labels Mar 7, 2019
@pfalcon
Copy link
Collaborator Author

pfalcon commented Mar 7, 2019

every bug needs a priority, if this is not a bug, either close it or track it as something else please.

Well, actually I wanted to drop an RFC to the mailing list that we need "waiting for verification" status for bugs. I skipped that, because there're to many fronts of work open already, and I'm not even sure if it should be generic "waiting for feedback" status instead. And here it's not even that something was "fixed", it's that issue was explained. As @nordic-krch writes:

I wonder if we shouldn't enable stack guard by default to avoid misinterpreted issues.

Deciding on things like that what would amount to resolution of this issue.

Or better, close as is without removing priority (@nordic-krch should have done that with explanation, @pfalcon retests and reopens if bug still present).

I don't believe you propose to close unresolved bugs. And nope, I'm not a bug technician here to remember that there're some closed, but unverified bugs. But I definitely try to help with reporting bugs and managing them, that's why I update priorities, etc.

@pfalcon pfalcon changed the title CONFIG_LOG_IMMEDIATE leads to fault in one of k_poll sub-routines CONFIG_LOG_IMMEDIATE leads to fault in one of k_poll sub-routines due to extended stack usage Mar 7, 2019
@pfalcon
Copy link
Collaborator Author

pfalcon commented Mar 7, 2019

Ok, so the current status is with master 61bcd76, the issue still occurs (samples/net/sockets/echo_server with CONFIG_LOG_IMMEDIATE=y). So, something needs to be done about that.

I wonder if we shouldn't enable stack guard by default to avoid misinterpreted issues.

@nordic-krch, when you say "stack guard", which exactly Kconfig option do you mean?

@pfalcon
Copy link
Collaborator Author

pfalcon commented Mar 7, 2019

when you say "stack guard", which exactly Kconfig option do you mean?

Ok, I assume you meant CONFIG_STACK_SENTINEL=y . (I'm somewhat mixed up with all those canaries and sentinels.)

@pfalcon
Copy link
Collaborator Author

pfalcon commented Mar 7, 2019

So, here's an example of changes which stems from looking into how to address this ticket: #14155 . As you imagine, that's pretty "far" and partial changes. 3-4 (or maybe 5-6) more changes like that are required to call this ticket resolved.

Sorry, but I don't have time to do all those 3-6 changes now, I'm already occupied with previous changes to make. So, this ticket is likely going to be open for a while. It's ok to move it into 1.15 timeframe of course, once it's clear it doesn't fit into 1.14. Thanks.

@pfalcon pfalcon changed the title CONFIG_LOG_IMMEDIATE leads to fault in one of k_poll sub-routines due to extended stack usage CONFIG_LOG_IMMEDIATE leads to unobvious faults in unrelated rotines due to stack overflow Mar 7, 2019
@andrewboie
Copy link
Contributor

uart:~$ ***** CPU Page Fault (error code 0x00000002)
Supervisor thread wrote address 0x00000000
PDE: 0x025 Present, Read-only, User, Execute Enabled
PTE: 0x00 Non-present, Read-only, Supervisor, Execute Enabled
Current thread ID = 0x004020d8

Are you sure stack overflow is the culprit?
This looks like a thread is writing to a NULL pointer.

@jukkar
Copy link
Member

jukkar commented May 22, 2019

Looks like this is no longer a valid issue -> closing.

@jukkar jukkar closed this as completed May 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Logging area: Networking bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug
Projects
None yet
Development

No branches or pull requests

7 participants