Skip to content

Commit 3812c8c

Browse files
hnaztorvalds
authored andcommitted
mm: memcg: do not trap chargers with full callstack on OOM
The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: Johannes Weiner <[email protected]> Reported-by: azurIt <[email protected]> Acked-by: Michal Hocko <[email protected]> Cc: David Rientjes <[email protected]> Cc: KAMEZAWA Hiroyuki <[email protected]> Cc: KOSAKI Motohiro <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent fb2a6fc commit 3812c8c

File tree

5 files changed

+140
-49
lines changed

5 files changed

+140
-49
lines changed

include/linux/memcontrol.h

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,10 @@ extern void mem_cgroup_replace_page_cache(struct page *oldpage,
157157
*
158158
* Toggle whether a failed memcg charge should invoke the OOM killer
159159
* or just return -ENOMEM. Returns the previous toggle state.
160+
*
161+
* NOTE: Any path that enables the OOM killer before charging must
162+
* call mem_cgroup_oom_synchronize() afterward to finalize the
163+
* OOM handling and clean up.
160164
*/
161165
static inline bool mem_cgroup_toggle_oom(bool new)
162166
{
@@ -182,6 +186,13 @@ static inline void mem_cgroup_disable_oom(void)
182186
WARN_ON(old == false);
183187
}
184188

189+
static inline bool task_in_memcg_oom(struct task_struct *p)
190+
{
191+
return p->memcg_oom.in_memcg_oom;
192+
}
193+
194+
bool mem_cgroup_oom_synchronize(void);
195+
185196
#ifdef CONFIG_MEMCG_SWAP
186197
extern int do_swap_account;
187198
#endif
@@ -427,6 +438,16 @@ static inline void mem_cgroup_disable_oom(void)
427438
{
428439
}
429440

441+
static inline bool task_in_memcg_oom(struct task_struct *p)
442+
{
443+
return false;
444+
}
445+
446+
static inline bool mem_cgroup_oom_synchronize(void)
447+
{
448+
return false;
449+
}
450+
430451
static inline void mem_cgroup_inc_page_stat(struct page *page,
431452
enum mem_cgroup_page_stat_item idx)
432453
{

include/linux/sched.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1395,6 +1395,10 @@ struct task_struct {
13951395
unsigned int memcg_kmem_skip_account;
13961396
struct memcg_oom_info {
13971397
unsigned int may_oom:1;
1398+
unsigned int in_memcg_oom:1;
1399+
unsigned int oom_locked:1;
1400+
int wakeups;
1401+
struct mem_cgroup *wait_on_memcg;
13981402
} memcg_oom;
13991403
#endif
14001404
#ifdef CONFIG_UPROBES

mm/memcontrol.c

Lines changed: 107 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -255,6 +255,7 @@ struct mem_cgroup {
255255

256256
bool oom_lock;
257257
atomic_t under_oom;
258+
atomic_t oom_wakeups;
258259

259260
int swappiness;
260261
/* OOM-Killer disable */
@@ -2020,6 +2021,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait,
20202021

20212022
static void memcg_wakeup_oom(struct mem_cgroup *memcg)
20222023
{
2024+
atomic_inc(&memcg->oom_wakeups);
20232025
/* for filtering, pass "memcg" as argument. */
20242026
__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
20252027
}
@@ -2031,32 +2033,26 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
20312033
}
20322034

20332035
/*
2034-
* try to call OOM killer. returns false if we should exit memory-reclaim loop.
2036+
* try to call OOM killer
20352037
*/
2036-
static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
2037-
int order)
2038+
static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
20382039
{
2039-
struct oom_wait_info owait;
20402040
bool locked;
2041+
int wakeups;
20412042

2042-
owait.memcg = memcg;
2043-
owait.wait.flags = 0;
2044-
owait.wait.func = memcg_oom_wake_function;
2045-
owait.wait.private = current;
2046-
INIT_LIST_HEAD(&owait.wait.task_list);
2043+
if (!current->memcg_oom.may_oom)
2044+
return;
2045+
2046+
current->memcg_oom.in_memcg_oom = 1;
20472047

20482048
/*
20492049
* As with any blocking lock, a contender needs to start
20502050
* listening for wakeups before attempting the trylock,
20512051
* otherwise it can miss the wakeup from the unlock and sleep
20522052
* indefinitely. This is just open-coded because our locking
20532053
* is so particular to memcg hierarchies.
2054-
*
2055-
* Even if signal_pending(), we can't quit charge() loop without
2056-
* accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
2057-
* under OOM is always welcomed, use TASK_KILLABLE here.
20582054
*/
2059-
prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
2055+
wakeups = atomic_read(&memcg->oom_wakeups);
20602056
mem_cgroup_mark_under_oom(memcg);
20612057

20622058
locked = mem_cgroup_oom_trylock(memcg);
@@ -2066,15 +2062,95 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
20662062

20672063
if (locked && !memcg->oom_kill_disable) {
20682064
mem_cgroup_unmark_under_oom(memcg);
2069-
finish_wait(&memcg_oom_waitq, &owait.wait);
20702065
mem_cgroup_out_of_memory(memcg, mask, order);
2066+
mem_cgroup_oom_unlock(memcg);
2067+
/*
2068+
* There is no guarantee that an OOM-lock contender
2069+
* sees the wakeups triggered by the OOM kill
2070+
* uncharges. Wake any sleepers explicitely.
2071+
*/
2072+
memcg_oom_recover(memcg);
20712073
} else {
2072-
schedule();
2073-
mem_cgroup_unmark_under_oom(memcg);
2074-
finish_wait(&memcg_oom_waitq, &owait.wait);
2074+
/*
2075+
* A system call can just return -ENOMEM, but if this
2076+
* is a page fault and somebody else is handling the
2077+
* OOM already, we need to sleep on the OOM waitqueue
2078+
* for this memcg until the situation is resolved.
2079+
* Which can take some time because it might be
2080+
* handled by a userspace task.
2081+
*
2082+
* However, this is the charge context, which means
2083+
* that we may sit on a large call stack and hold
2084+
* various filesystem locks, the mmap_sem etc. and we
2085+
* don't want the OOM handler to deadlock on them
2086+
* while we sit here and wait. Store the current OOM
2087+
* context in the task_struct, then return -ENOMEM.
2088+
* At the end of the page fault handler, with the
2089+
* stack unwound, pagefault_out_of_memory() will check
2090+
* back with us by calling
2091+
* mem_cgroup_oom_synchronize(), possibly putting the
2092+
* task to sleep.
2093+
*/
2094+
current->memcg_oom.oom_locked = locked;
2095+
current->memcg_oom.wakeups = wakeups;
2096+
css_get(&memcg->css);
2097+
current->memcg_oom.wait_on_memcg = memcg;
20752098
}
2099+
}
2100+
2101+
/**
2102+
* mem_cgroup_oom_synchronize - complete memcg OOM handling
2103+
*
2104+
* This has to be called at the end of a page fault if the the memcg
2105+
* OOM handler was enabled and the fault is returning %VM_FAULT_OOM.
2106+
*
2107+
* Memcg supports userspace OOM handling, so failed allocations must
2108+
* sleep on a waitqueue until the userspace task resolves the
2109+
* situation. Sleeping directly in the charge context with all kinds
2110+
* of locks held is not a good idea, instead we remember an OOM state
2111+
* in the task and mem_cgroup_oom_synchronize() has to be called at
2112+
* the end of the page fault to put the task to sleep and clean up the
2113+
* OOM state.
2114+
*
2115+
* Returns %true if an ongoing memcg OOM situation was detected and
2116+
* finalized, %false otherwise.
2117+
*/
2118+
bool mem_cgroup_oom_synchronize(void)
2119+
{
2120+
struct oom_wait_info owait;
2121+
struct mem_cgroup *memcg;
2122+
2123+
/* OOM is global, do not handle */
2124+
if (!current->memcg_oom.in_memcg_oom)
2125+
return false;
2126+
2127+
/*
2128+
* We invoked the OOM killer but there is a chance that a kill
2129+
* did not free up any charges. Everybody else might already
2130+
* be sleeping, so restart the fault and keep the rampage
2131+
* going until some charges are released.
2132+
*/
2133+
memcg = current->memcg_oom.wait_on_memcg;
2134+
if (!memcg)
2135+
goto out;
2136+
2137+
if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
2138+
goto out_memcg;
2139+
2140+
owait.memcg = memcg;
2141+
owait.wait.flags = 0;
2142+
owait.wait.func = memcg_oom_wake_function;
2143+
owait.wait.private = current;
2144+
INIT_LIST_HEAD(&owait.wait.task_list);
20762145

2077-
if (locked) {
2146+
prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
2147+
/* Only sleep if we didn't miss any wakeups since OOM */
2148+
if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups)
2149+
schedule();
2150+
finish_wait(&memcg_oom_waitq, &owait.wait);
2151+
out_memcg:
2152+
mem_cgroup_unmark_under_oom(memcg);
2153+
if (current->memcg_oom.oom_locked) {
20782154
mem_cgroup_oom_unlock(memcg);
20792155
/*
20802156
* There is no guarantee that an OOM-lock contender
@@ -2083,11 +2159,10 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
20832159
*/
20842160
memcg_oom_recover(memcg);
20852161
}
2086-
2087-
if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
2088-
return false;
2089-
/* Give chance to dying process */
2090-
schedule_timeout_uninterruptible(1);
2162+
css_put(&memcg->css);
2163+
current->memcg_oom.wait_on_memcg = NULL;
2164+
out:
2165+
current->memcg_oom.in_memcg_oom = 0;
20912166
return true;
20922167
}
20932168

@@ -2400,12 +2475,11 @@ enum {
24002475
CHARGE_RETRY, /* need to retry but retry is not bad */
24012476
CHARGE_NOMEM, /* we can't do more. return -ENOMEM */
24022477
CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */
2403-
CHARGE_OOM_DIE, /* the current is killed because of OOM */
24042478
};
24052479

24062480
static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
24072481
unsigned int nr_pages, unsigned int min_pages,
2408-
bool oom_check)
2482+
bool invoke_oom)
24092483
{
24102484
unsigned long csize = nr_pages * PAGE_SIZE;
24112485
struct mem_cgroup *mem_over_limit;
@@ -2462,14 +2536,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
24622536
if (mem_cgroup_wait_acct_move(mem_over_limit))
24632537
return CHARGE_RETRY;
24642538

2465-
/* If we don't need to call oom-killer at el, return immediately */
2466-
if (!oom_check || !current->memcg_oom.may_oom)
2467-
return CHARGE_NOMEM;
2468-
/* check OOM */
2469-
if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask, get_order(csize)))
2470-
return CHARGE_OOM_DIE;
2539+
if (invoke_oom)
2540+
mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(csize));
24712541

2472-
return CHARGE_RETRY;
2542+
return CHARGE_NOMEM;
24732543
}
24742544

24752545
/*
@@ -2572,22 +2642,16 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
25722642
}
25732643

25742644
do {
2575-
bool oom_check;
2645+
bool invoke_oom = oom && !nr_oom_retries;
25762646

25772647
/* If killed, bypass charge */
25782648
if (fatal_signal_pending(current)) {
25792649
css_put(&memcg->css);
25802650
goto bypass;
25812651
}
25822652

2583-
oom_check = false;
2584-
if (oom && !nr_oom_retries) {
2585-
oom_check = true;
2586-
nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
2587-
}
2588-
2589-
ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, nr_pages,
2590-
oom_check);
2653+
ret = mem_cgroup_do_charge(memcg, gfp_mask, batch,
2654+
nr_pages, invoke_oom);
25912655
switch (ret) {
25922656
case CHARGE_OK:
25932657
break;
@@ -2600,16 +2664,12 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
26002664
css_put(&memcg->css);
26012665
goto nomem;
26022666
case CHARGE_NOMEM: /* OOM routine works */
2603-
if (!oom) {
2667+
if (!oom || invoke_oom) {
26042668
css_put(&memcg->css);
26052669
goto nomem;
26062670
}
2607-
/* If oom, we never return -ENOMEM */
26082671
nr_oom_retries--;
26092672
break;
2610-
case CHARGE_OOM_DIE: /* Killed by OOM Killer */
2611-
css_put(&memcg->css);
2612-
goto bypass;
26132673
}
26142674
} while (ret != CHARGE_OK);
26152675

mm/memory.c

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3867,6 +3867,9 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
38673867
if (flags & FAULT_FLAG_USER)
38683868
mem_cgroup_disable_oom();
38693869

3870+
if (WARN_ON(task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM)))
3871+
mem_cgroup_oom_synchronize();
3872+
38703873
return ret;
38713874
}
38723875

mm/oom_kill.c

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -678,9 +678,12 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
678678
*/
679679
void pagefault_out_of_memory(void)
680680
{
681-
struct zonelist *zonelist = node_zonelist(first_online_node,
682-
GFP_KERNEL);
681+
struct zonelist *zonelist;
683682

683+
if (mem_cgroup_oom_synchronize())
684+
return;
685+
686+
zonelist = node_zonelist(first_online_node, GFP_KERNEL);
684687
if (try_set_zonelist_oom(zonelist, GFP_KERNEL)) {
685688
out_of_memory(NULL, 0, 0, NULL, false);
686689
clear_zonelist_oom(zonelist, GFP_KERNEL);

0 commit comments

Comments
 (0)