Skip to content

intermittent issue with tests/kernel/fatal #7291

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
andrewboie opened this issue May 1, 2018 · 3 comments
Closed

intermittent issue with tests/kernel/fatal #7291

andrewboie opened this issue May 1, 2018 · 3 comments
Assignees
Labels
area: Kernel bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug

Comments

@andrewboie
Copy link
Contributor

starting test - test_fatal
test alt thread 1: generic CPU exception
***** CPU exception 6
Current thread ID = 0x00400060
eax: 0x00000000, ebx: 0x00000000, ecx: 0x80000000, edx: 0x0040e040
esi: 0x00000000, edi: 0x00004d85, ebp: 0x00405fd0, esp: 0x00405fd0
eflags: 0x00000206 cs: 0x0008
call trace:
eip: 0x00001322
     0x00004d95 (0x0)
Caught system error -- reason 6
test alt thread 2: initiate kernel oops
***** Kernel OOPS! *****
Current thread ID = 0x00400060
eax: 0x00000206, ebx: 0x00000000, ecx: 0x80000000, edx: 0x0040e040
esi: 0x00000000, edi: 0x00004d85, ebp: 0x00405fd0, esp: 0x00405fcc
eflags: 0x00000006 cs: 0x0008
call trace:
eip: 0x0000133a
     0x00004d95 (0x0)
Caught system error -- reason 7
test alt thread 3: initiate kernel panic
***** Kernel Panic! *****
Current thread ID = 0x00400060
eax: 0x00000206, ebx: 0x00000000, ecx: 0x80000000, edx: 0x0040e040
esi: 0x00000000, edi: 0x00004d85, ebp: 0x00405fd0, esp: 0x00405fcc
eflags: 0x00000006 cs: 0x0008
call trace:
eip: 0x00001344
     0x00004d95 (0x0)
Caught system error -- reason 8
test stack overflow - timer irq
***** Stack Check Fail! *****
Current thread ID = 0x00400060
eax: 0x00004d95, ebx: 0x00000000, ecx: 0x80000000, edx: 0x0040e040
esi: 0x00000000, edi: 0x00004d85, ebp: 0x00405fc8, esp: 0x00404fc8
eflags: 0x00000202 cs: 0x0008
call trace:
eip: 0x00001730
     0x00001764 (0x405fe8)
     0x00004d95 (0x0)
Caught system error -- reason 4249512

    Assertion failed at /projects/zephyr/tests/kernel/fatal/src/main.c:224: test_fatal: (crash_reason not equal to expected_reason)
bad reason code got 4249512 expected 4
FAIL - test_fatal.

Seems to be some kind of race? This only happens sometimes.

@nashif nashif added the bug The issue is a bug, or the PR is fixing a bug label May 2, 2018
@MaureenHelm MaureenHelm added priority: medium Medium impact/importance bug area: Kernel labels May 4, 2018
@nashif
Copy link
Member

nashif commented May 29, 2018

move to low, since this is intermittent and not easily reproducible.

@nashif nashif added priority: low Low impact/importance bug and removed priority: medium Medium impact/importance bug labels May 29, 2018
@spoorthik
Copy link
Contributor

spoorthik commented Jul 16, 2018

I too found similar crash when running sanitycheck:

Latest Commit ID: afad09d

***** Booting Zephyr OS v1.12.0-790-gafad09d *****
Running test suite fatal
===================================================================
starting test - test_fatal
test alt thread 1: generic CPU exception
***** CPU exception 6
Current thread ID = 0x00400060
eax: 0x00000000, ebx: 0x00000000, ecx: 0x80000000, edx: 0x00413060
esi: 0x00000000, edi: 0x000016fe, ebp: 0x00406fd0, esp: 0x00406fd0
eflags: 0x00000216 cs: 0x0008
call trace:
eip: 0x00001231
     0x0000170e (0x0)
Caught system error -- reason 6
test alt thread 2: initiate kernel oops
***** Kernel OOPS! *****
Current thread ID = 0x00400060
eax: 0x00000216, ebx: 0x00000000, ecx: 0x80000000, edx: 0x00413060
esi: 0x00000000, edi: 0x000016fe, ebp: 0x00406fd0, esp: 0x00406fcc
eflags: 0x00000016 cs: 0x0008
call trace:
eip: 0x00001249
     0x0000170e (0x0)
Caught system error -- reason 7
test alt thread 3: initiate kernel panic
***** Kernel Panic! *****
Current thread ID = 0x00400060
eax: 0x00000216, ebx: 0x00000000, ecx: 0x80000000, edx: 0x00413060
esi: 0x00000000, edi: 0x000016fe, ebp: 0x00406fd0, esp: 0x00406fcc
eflags: 0x00000016 cs: 0x0008
call trace:
eip: 0x00001253
 0x0000170e (0x0)
Caught system error -- reason 8
test stack overflow - timer irq
***** General Protection Fault
***** Exception code: 0x60
Current thread ID = 0x00400060
eax: 0x004127f0, ebx: 0x00000000, ecx: 0x00000000, edx: 0x00000216
esi: 0x00000000, edi: 0x00000216, ebp: 0x004127fc, esp: 0x004127d0
eflags: 0x00000046 cs: 0x0008
call trace:
eip: 0x00002d1f
Caught system error -- reason 6

    Assertion failed at zephyr.git/tests/kernel/fatal/src/main.c:234: test_fatal: (crash_reason not equal to _N
bad reason code got 6 expected 4

FAIL - test_fatal
===================================================================
===================================================================
PROJECT EXECUTION FAILED

andrewboie pushed a commit to andrewboie/zephyr that referenced this issue Feb 13, 2019
In the event of a double fault, we do a HW task switch to
a special _df_tss hardware task which resets the stack
pointer to the interrupt stack and otherwise restores
the main hardware task to a runnable state so that
_df_handler_bottom() can run.

However, we need to make sure that _df_handler_bottom()
runs with interrupts locked, otherwise another IRQ could
corrupt the interrupt stack resulting in undefined
behavior.

We have very little stack space to work with in this
context, just zero it. It's a fatal error for the thread
in any event.

Fixes: zephyrproject-rtos#7291

Signed-off-by: Andrew Boie <[email protected]>
@andrewboie andrewboie self-assigned this Feb 13, 2019
@andrewboie
Copy link
Contributor Author

I think this is an unlucky timer interrupt when the double-fault handler bottom half is using the interrupt stack, which is corrupting the reason code, it doesn't ensure that interrupts are locked and an interrupt that fires will clobber its context.

sent a patch to zero EFLAGS when df_handler bottom is running, previously it was inheriting whatever flags were when the double fault happened.

andrewboie pushed a commit that referenced this issue Feb 13, 2019
In the event of a double fault, we do a HW task switch to
a special _df_tss hardware task which resets the stack
pointer to the interrupt stack and otherwise restores
the main hardware task to a runnable state so that
_df_handler_bottom() can run.

However, we need to make sure that _df_handler_bottom()
runs with interrupts locked, otherwise another IRQ could
corrupt the interrupt stack resulting in undefined
behavior.

We have very little stack space to work with in this
context, just zero it. It's a fatal error for the thread
in any event.

Fixes: #7291

Signed-off-by: Andrew Boie <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Kernel bug The issue is a bug, or the PR is fixing a bug priority: low Low impact/importance bug
Projects
None yet
Development

No branches or pull requests

4 participants