kernel/sched: More nonatomic swap fixes #13770
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
[This was popping up all over the place on ARM tests, both in hardware and emulation. Seems like a likely cause for #13536, #12559 and #12352 at least.]
Nonatomic swap strikes again. These issues are all longstanding, but
were unmasked by the dlist work in commit d40b8ce ("sys: dlist:
Add sys_dnode_is_linked") where list node pointers become nulls on
removal.
The previous fix was for a specific case where a timeslicing interrupt
would try to slice out the "wrong" current thread because the thread
has "just" pended itself. That was incomplete, because the parallel
code in k_sleep() didn't flag itself the same way.
And beyond that, it turns out to be basically impossible (now that I'm
thinking about it correctly) to prevent interrupt code from calling
into the scheduler to suspend a "just pended but not quite" current
and/or preempt away to another thread. In any of these cases, the
scheduler modifications to the state bits remain correct but the queue
nodes may be corrupt because the thread was already removed from the
ready queue. So we have to test and correct this at the lowest level,
where a thread is being removed from a priq: check that it's (1) the
ready queue and not a waitq, (2) the current thread, and (3) already
marked suspended and thus not in the queue.
There are lots of existing issues filed in the last few months all
pointing to odd instability on ARM platforms. I'm reasonably certain
this is the root cause for most or all of them.
Signed-off-by: Andy Ross [email protected]