# cgroup v2 OOM group

One annoying issue with setting a memory limit for a container is that the OOM killer kernel process can leave the container in an inconsistent state with only some processes terminated.

When the system or the cgroup runs out of memory, the OOM killer is triggered and the kernel will try to free some memory.

The kernel will iterate the potential processes to terminate, that is either any process on the host, or the ones in the cgroup when the OOM is local to the cgroup. For each process it calculates a badness score and then kill the process that scores the most.

The badness heuristic was changed a few times, in its current form it takes into account how much memory the process uses, whether the process is killable and adjust the score by a configurable value that is configurable from user space.

The OOM killer works in a similar way either when the entire system is running low on memory or a memory cgroup limit is being violated. The difference is in the set of processes considered for termination.

If the cgroup has reached its memory limit, only one process will be terminated. In most cases this behavior causes to leave the container in an inconsistent state, with the remaining processes running.

A new knob was added for cgroup v2 with the patch:

commit 3d8b38eb81cac81395f6a823f6bf401b327268e6
Author: Roman Gushchin <[email protected]>
Date:   Tue Aug 21 21:53:54 2018 -0700

mm, oom: introduce memory.oom.group

For some workloads an intervention from the OOM killer can be painful.
Killing a random task can bring the workload into an inconsistent state.

....


If memory.oom.group is set, the entire cgroup is killed as an indivisible unit.

Unfortunately OCI containers cannot take advantage of this feature yet, as there is no way to specify the setting in the current version of the OCI runtime specs.

The memory.oom.group setting can be specified at any level in the cgroup hierarchy.