When I started working on crun, I was looking at a faster way to start up and stop containers by improving the OCI runtime, the component in the OCI stack that is responsible for talking to the kernel and setting up the environment where the container runs. Over roughly five years, a combination of kernel patches and userspace fixes reduced the time to start and stop a container from around 160 ms to just over 5 ms — nearly a 30x improvement — through targeted work on network namespace teardown, mqueue mount overhead, IPC namespace cleanup, and seccomp profile compilation.
Posts for: #Performance
Avoid a memory page allocation on mount(2)
While working on crun, I got surprised by how much time the kernel spent in the copy_mount_options function. A container runtime issues a large number of mount(2) syscalls during startup — bind mounts, proc, sysfs, devtmpfs, and more — many of them with no extra options to pass. It turned out that passing an empty string instead of NULL for the data argument caused the kernel to allocate a full memory page and attempt a copy from user space on every one of those calls, adding measurable overhead.
C is a better fit for tools like an OCI runtime
I’ve spent some of the last weeks working on a replacement for runC, the most used/known OCI runtime for running containers. It might not be very well known, but it is a key component for running containers. Every Docker container ultimately runs through runC. The OCI runtime is the thin layer between the container engine and the kernel: it reads a JSON configuration file, creates the necessary namespaces and cgroups, sets up mounts and capabilities, and finally execs the container process. Because it runs for such a short time and its workload is almost entirely syscalls, the implementation language matters for startup latency.