Skip to content

kernel/sched: More nonatomic swap fixes #13770

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 27, 2019

Conversation

andyross
Copy link
Collaborator

[This was popping up all over the place on ARM tests, both in hardware and emulation. Seems like a likely cause for #13536, #12559 and #12352 at least.]

Nonatomic swap strikes again. These issues are all longstanding, but
were unmasked by the dlist work in commit d40b8ce ("sys: dlist:
Add sys_dnode_is_linked") where list node pointers become nulls on
removal.

The previous fix was for a specific case where a timeslicing interrupt
would try to slice out the "wrong" current thread because the thread
has "just" pended itself. That was incomplete, because the parallel
code in k_sleep() didn't flag itself the same way.

And beyond that, it turns out to be basically impossible (now that I'm
thinking about it correctly) to prevent interrupt code from calling
into the scheduler to suspend a "just pended but not quite" current
and/or preempt away to another thread. In any of these cases, the
scheduler modifications to the state bits remain correct but the queue
nodes may be corrupt because the thread was already removed from the
ready queue. So we have to test and correct this at the lowest level,
where a thread is being removed from a priq: check that it's (1) the
ready queue and not a waitq, (2) the current thread, and (3) already
marked suspended and thus not in the queue.

There are lots of existing issues filed in the last few months all
pointing to odd instability on ARM platforms. I'm reasonably certain
this is the root cause for most or all of them.

Signed-off-by: Andy Ross [email protected]

Nonatomic swap strikes again.  These issues are all longstanding, but
were unmasked by the dlist work in commit d40b8ce ("sys: dlist:
Add sys_dnode_is_linked") where list node pointers become nulls on
removal.

The previous fix was for a specific case where a timeslicing interrupt
would try to slice out the "wrong" current thread because the thread
has "just" pended itself.  That was incomplete, because the parallel
code in k_sleep() didn't flag itself the same way.

And beyond that, it turns out to be basically impossible (now that I'm
thinking about it correctly) to prevent interrupt code from calling
into the scheduler to suspend a "just pended but not quite" current
and/or preempt away to another thread.  In any of these cases, the
scheduler modifications to the state bits remain correct but the queue
nodes may be corrupt because the thread was already removed from the
ready queue.  So we have to test and correct this at the lowest level,
where a thread is being removed from a priq: check that it's (1) the
ready queue and not a waitq, (2) the current thread, and (3) already
marked suspended and thus not in the queue.

There are lots of existing issues filed in the last few months all
pointing to odd instability on ARM platforms.  I'm reasonably certain
this is the root cause for most or all of them.

Signed-off-by: Andy Ross <[email protected]>
@codecov-io
Copy link

Codecov Report

Merging #13770 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #13770   +/-   ##
=======================================
  Coverage   52.26%   52.26%           
=======================================
  Files         307      307           
  Lines       45408    45408           
  Branches    10504    10504           
=======================================
  Hits        23733    23733           
  Misses      16886    16886           
  Partials     4789     4789
Impacted Files Coverage Δ
kernel/sched.c 84.01% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0f6b6fc...277d52f. Read the comment docs.

@andrewboie andrewboie merged commit dff6b71 into zephyrproject-rtos:master Feb 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants