cgroup v2 OOM group


One annoying issue with setting a memory limit for a container is that the OOM killer kernel process can leave the container in an inconsistent state with only some processes terminated.

When the system or the cgroup runs out of memory, the OOM killer is triggered and the kernel will try to free some memory.

The kernel will iterate the potential processes to terminate, that is either any process on the host, or the ones in the cgroup when the OOM is local to the cgroup. For each process it calculates a badness score and then kill the process that scores the most.

The badness heuristic was changed a few times, in its current form it takes into account how much memory the process uses, whether the process is killable and adjust the score by a configurable value that is configurable from user space.

The OOM killer works in a similar way either when the entire system is running low on memory or a memory cgroup limit is being violated. The difference is in the set of processes considered for termination.

If the cgroup has reached its memory limit, only one process will be terminated. In most cases this behavior causes to leave the container in an inconsistent state, with the remaining processes running.

A new knob was added for cgroup v2 with the patch:

commit 3d8b38eb81cac81395f6a823f6bf401b327268e6
Author: Roman Gushchin <[email protected]>
Date:   Tue Aug 21 21:53:54 2018 -0700

    mm, oom: introduce memory.oom.group

    For some workloads an intervention from the OOM killer can be painful.
    Killing a random task can bring the workload into an inconsistent state.

    ....

If memory.oom.group is set, the entire cgroup is killed as an indivisible unit.

Unfortunately OCI containers cannot take advantage of this feature yet, as there is no way to specify the setting in the current version of the OCI runtime specs.

OCI containers adoption

The discussion for adding cgroup v2 support to the runtime specs is still under review: runtime-specs cgroup v2 support

Once that lands, we can extend the containers runtime to set the configuration when it is the desired behavior.

The memory.oom.group setting can be specified at any level in the cgroup hierarchy.

In the Kubernetes world, we could support both a per-container and a per-pod mode OOM group mode. In the per-container mode, only the processes for a single container will be terminated on OOM. Instead, if the setting is configured for the pod, on an OOM event the entire pod is terminated without leaving any process behind.

The main difficulty with the second configuration is that shim processes that are usually running in the pod cgroup must be moved somewhere else, otherwise they will be terminated as part of the OOM killer cleanup.