I have been working on a new functionality for the prctl syscall that addresses a common security concern with container runtimes. The /proc/self/exe symlink, which points to the executable of the running process, was the key ingredient in CVE-2019-5736, a vulnerability that allowed a malicious container to overwrite the container runtime binary on the host. The workaround deployed at the time — re-execing from a copy or using a read-only bind mount — treats the symptom rather than the cause.
Posts for: #Kernel
The journey to speed up running OCI containers
When I started working on crun, I was looking at a faster way to start up and stop containers by improving the OCI runtime, the component in the OCI stack that is responsible for talking to the kernel and setting up the environment where the container runs. Over roughly five years, a combination of kernel patches and userspace fixes reduced the time to start and stop a container from around 160 ms to just over 5 ms — nearly a 30x improvement — through targeted work on network namespace teardown, mqueue mount overhead, IPC namespace cleanup, and seccomp profile compilation.
An interesting issue handling the seccomp listener
A bug report filed against crun a few days ago exposed a deadlock: under certain seccomp profiles, the runtime would hang indefinitely before the container process ever started. The root cause is a subtle sequencing problem between installing a seccomp filter that intercepts a syscall and then immediately using that same syscall to hand off the resulting listener file descriptor to the userspace handler — the very handler that has not yet received the descriptor it needs to process the interception.
Cgroup v2 OOM group
One annoying issue with setting a memory limit for a container is that the OOM killer can leave the container in an inconsistent state with only some of its processes terminated. When a cgroup hits its memory limit, the kernel selects a single process to kill based on a badness score, not all the processes in the cgroup. This means that a multi-process container — for example, one running a web server and several worker processes — may continue running in a broken state after the OOM event rather than being cleanly torn down.
Avoid a memory page allocation on mount(2)
While working on crun, I got surprised by how much time the kernel spent in the copy_mount_options function. A container runtime issues a large number of mount(2) syscalls during startup — bind mounts, proc, sysfs, devtmpfs, and more — many of them with no extra options to pass. It turned out that passing an empty string instead of NULL for the data argument caused the kernel to allocate a full memory page and attempt a copy from user space on every one of those calls, adding measurable overhead.