Locking composefs images with flock

2026-07-10Giuseppe Scrivano

For the last couple of weeks, I’ve been working on garbage collection for composefs repositories. A composefs repository is a content-addressed object store, and it needs a garbage collector to reclaim space from images that are no longer referenced. The hard part is deciding what “referenced” actually means when an image has no ref in the repository but is mounted somewhere, possibly from another process, possibly in a different mount namespace, possibly by the running system itself. On composefs-rs#346 was suggested to use flock() on the EROFS backing file. Mounters take a shared lock, the GC probes with an exclusive one, and if the exclusive lock fails, the image is still in use.

[read more]

Hide the current process executable file

2022-12-21Giuseppe Scrivano

#kernel #security #prctl #cve

I have been working on a new functionality for the prctl syscall that addresses a common security concern with container runtimes. The /proc/self/exe symlink, which points to the executable of the running process, was the key ingredient in CVE-2019-5736, a vulnerability that allowed a malicious container to overwrite the container runtime binary on the host. The workaround deployed at the time — re-execing from a copy or using a read-only bind mount — treats the symptom rather than the cause.

[read more]

The journey to speed up running OCI containers

2022-09-21Giuseppe Scrivano

#oci #crun #performance #seccomp #kernel

When I started working on crun, I was looking at a faster way to start up and stop containers by improving the OCI runtime, the component in the OCI stack that is responsible for talking to the kernel and setting up the environment where the container runs. Over roughly five years, a combination of kernel patches and userspace fixes reduced the time to start and stop a container from around 160 ms to just over 5 ms — nearly a 30x improvement — through targeted work on network namespace teardown, mqueue mount overhead, IPC namespace cleanup, and seccomp profile compilation.

[read more]

An interesting issue handling the seccomp listener

2022-09-05Giuseppe Scrivano

#seccomp #crun #kernel

A bug report filed against crun a few days ago exposed a deadlock: under certain seccomp profiles, the runtime would hang indefinitely before the container process ever started. The root cause is a subtle sequencing problem between installing a seccomp filter that intercepts a syscall and then immediately using that same syscall to hand off the resulting listener file descriptor to the userspace handler — the very handler that has not yet received the descriptor it needs to process the interception.

[read more]

Cgroup v2 OOM group

2020-08-14Giuseppe Scrivano

#cgroups v2 #oom #kernel

One annoying issue with setting a memory limit for a container is that the OOM killer can leave the container in an inconsistent state with only some of its processes terminated. When a cgroup hits its memory limit, the kernel selects a single process to kill based on a badness score, not all the processes in the cgroup. This means that a multi-process container — for example, one running a web server and several worker processes — may continue running in a broken state after the OOM event rather than being cleanly torn down.

[read more]

Avoid a memory page allocation on mount(2)

2019-12-27Giuseppe Scrivano

#mount #kernel #performance #crun

While working on crun, I got surprised by how much time the kernel spent in the copy_mount_options function. A container runtime issues a large number of mount(2) syscalls during startup — bind mounts, proc, sysfs, devtmpfs, and more — many of them with no extra options to pass. It turned out that passing an empty string instead of NULL for the data argument caused the kernel to allocate a full memory page and attempt a copy from user space on every one of those calls, adding measurable overhead.

[read more]

Posts for: #Kernel

Locking composefs images with flock

Hide the current process executable file

The journey to speed up running OCI containers

An interesting issue handling the seccomp listener

Cgroup v2 OOM group

Avoid a memory page allocation on mount(2)