Skip to content

mempool allocator can return with no allocation even if memory is available #14508

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pdunaj opened this issue Mar 14, 2019 · 10 comments · Fixed by #14514
Closed

mempool allocator can return with no allocation even if memory is available #14508

pdunaj opened this issue Mar 14, 2019 · 10 comments · Fixed by #14514
Assignees
Labels
area: Kernel bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug

Comments

@pdunaj
Copy link
Collaborator

pdunaj commented Mar 14, 2019

Describe the bug
I was wondering if #14504 could happen if memory is actually available. I reviewed the code again and I think it can. This is theoretical scenario, I have not yet tried to replicate it.

The _sys_mem_pool_block_alloc has two loops. First is looking which level at which blocks are available (free_l and level at which we would like to make an allocation alloc_l). Second is allocating and if allocation is done in the higher level then it is splitting the allocated block.

Assume preconditions where no allocation was made and only the biggest block is available (i.e. only level 0 is non-empty). Assume we want to alloc small block from bottom level (anything but the biggest block actually).
Imagine a following race. Context A (e.g. thread) is traversing the first loop. As nothing was broken down yet it finds that the only non empty level (i.e. with free blocks) is level 0. It also finds a level at which it would like to make allocation which will be greater then zero and is not really important.
After finding two indexes but before context A can take a lock and do any allocation a high priority context B (e.g. interrupt) is scheduled and takes over the CPU. It does the same traverse as did context A and finds that only level 0 is non empty. It takes the lock and and enters the second loop. Only level 0 had non empty list of free blocks so it will loop once. Block allocation succeeds and data is not null and since we wanted smaller chunk it will be broken down filling in lists on higher levels. Note here that level zero had only one block so allocating it made the list empty. Context B is done and context A is scheduled back. It resumes just before traversing the second loop. It found that only level 0 has non empty list so it will loop only once. It tries to allocate but fails as list is empty (context B tool the only available element). As this was the only cycle loop finishes with nothing allocated. data is NULL and success is reported (check the other bug).

To Reproduce
Steps to reproduce the behavior:
See description

Expected behavior
Allocation should always give block if memory is available.

Impact
showstopper

Screenshots or console output
N/A

Environment (please complete the following information):
ncs zephyr 745326266d9c32ba3aa67d5af1f727059d9a88e3 (from d9876be upstream)

Additional context
N/A

@pdunaj pdunaj added bug The issue is a bug, or the PR is fixing a bug area: Kernel labels Mar 14, 2019
@pdunaj
Copy link
Collaborator Author

pdunaj commented Mar 14, 2019

There is another bug - I will not create separate issue as the effect is the same (allocation will fail with memory available).

Assume following level 0 is empty, level 1 is 3/4 free and 1 chunk is allocated.
Reproduction will be as follows. Context A starts traverse of the first loop. It checks that level 0 is empty and continues. At this point high priority context B takes over and does free. When free is done all blocks on level 1 are merged and put back to level 0 as one big block. Conext B finishes and context A is scheduled back. It continue with the loop at level 1 and sees that there are no free blocks on this level nor on any higher level. It will finish after loop is done with -ENOMEM.

@pdunaj
Copy link
Collaborator Author

pdunaj commented Mar 14, 2019

FYI @andyross

@pdunaj
Copy link
Collaborator Author

pdunaj commented Mar 14, 2019

I have pushed the fixed.
Shame I have not seen the problem the previous time I reviewed this code... :(

@andyross
Copy link
Collaborator

Heh, never mind. You already have a patch up. You take it :)

@andyross
Copy link
Collaborator

Yeah, we can't do it this way. The fix effectively wraps the whole allocation in a lock, as I understand it. That's counter to the original design intent. Maybe try flagging the existence of a race somehow with a "last thread to enter allocator" state variable or something as a cue to retry?

@pdunaj
Copy link
Collaborator Author

pdunaj commented Mar 14, 2019

Hi @andyross , if you check again you will notice that the amount of the code under the lock is not really increased. What was added is (essentially) checking if free list on n level is empty. And you need to do it under a lock anyway as otherwise content of the list may be broken. Costly operations of alloc and break are already under a lock.

Yes, there is an alternative. Each time allocator state is updated a token could be changed. Block allocator could take a token, do barrier, try allocating and in case there is nomem, grab the current token and if it changed do a retry. I was thinking about this solution but the main problem is that it is unbound. Second thing is that, as I written above, the change is bringing under a lock only checks which levels are empty (which is really cheap operation). The third thing is that such solutions are prone to bugs.

The current state is away from the original design already as locks were introduced to handle various races conditions. Please review the code again - the number of operations added under the lock is really small.

@andyross
Copy link
Collaborator

I swear I see it holding the lock around a for() loop, no? Can you split the patch to separate fixes from each other, and move the changes to formatting and naming into a early refactoring patch? I'll take a closer look.

But the core requirement here is that the lock needs to be held for absolutely minimum amounts of time.

FWIW: I just very quickly through together a "retry" version of the fix. It seems to pass tests, and looks like this: https://gist.github.com/andyross/e94fbb8f2635e26c9c714106dd4006d5

Obviously it involves extra CPU work in the case of contention, but the worst case latency is 100% unchanged. I'm bailing out after four tries right now, which IMHO is acceptable (trying and failing to get memory in the face of that much contention is almost certainly a legitimate runtime failure!).

But even that could be handled one level up at the kernel mempool layer, where we can block. We could provide a "block new allocators" and "wakeup" signal at that level, which might be a better idea still.

It's not 100% clear to me if this patch alone fixes the issues you saw or if there's more complexity. But broadly this is I think where we want to be as far as design.

@pdunaj
Copy link
Collaborator Author

pdunaj commented Mar 14, 2019

@andyross under the for loop you have block_alloc that first checks if any block is available on the level. Only the last block_alloc is expensive.
Token check and retry is unbound. You have no guarantee when started alloc will finish. Also you cannot simply bail out if memory is available in the system (although I agree the probability of this is low I don't agree this is a legitimate failure; e.g. low priority background task can be preempted by everything in the system increasing the probability of the failure). In the end this will bring more problems then it solves.

I would say what I said before: nobody would like to use a very fast tool that gives incorrect results.

@andyross
Copy link
Collaborator

andyross commented Mar 14, 2019

I would say what I said before: nobody would like to use a very fast tool that gives incorrect results.

Please please please don't start fights. I'm trying to understand your patch, I promise. Calm down. I can't merge what I don't understand.

Please split the patch. I see at least three separate issues (I think):

  1. there's a wrong return value from z_sys_mem_pool_block_alloc() when it exhausts the levels bug data is still NULL.
  2. There's a race you're fixing
  3. You're moving comments around and making formatting changes to unrelated code.

Numbers 1. and 3. seem like easy things to review and approve, making the discussion about #2 a little more obvious.

@rljordan-zz rljordan-zz added the priority: medium Medium impact/importance bug label Mar 15, 2019
@pdunaj
Copy link
Collaborator Author

pdunaj commented Mar 15, 2019

don't start fights

This was not my intention. I will try to be more careful with comments next time.

Please split the patch.

I will.
I will also try to benchmark the influence of the change and put some numbers. When I got back home I realized this was the thing I should do in the first place.

Be back later today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Kernel bug The issue is a bug, or the PR is fixing a bug priority: medium Medium impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants