Image sealing with composefs

Composefs achieves whole-filesystem integrity verification through image sealing: a single cryptographic digest authenticates an entire filesystem, covering both file contents and metadata (directory structure, permissions, ownership, symlinks, and xattrs). This goes further than fs-verity alone, which can only verify individual file contents, and avoids the fixed-partition requirement of dm-verity. The mechanism combines an EROFS image for metadata, a content-addressed object store for file data, and overlayfs with verity=require to enforce integrity checks on every file access at the kernel level.

[read more]

GitChronicler: Write commit messages with AI

I started working on GitChronicler mostly to learn how I could integrate AI into my workflow in a way that would actually spare me doing boring stuff, like writing the git commit message. The tool feeds a patch to a language model via the OpenRouter API and gets back a commit message that reflects what the code actually does — saving the mechanical step of describing changes that are already fully visible in the diff, while still leaving the developer in control of what gets committed.

[read more]

Why do I have two /sys/fs/cgroup in my container

It happened a few times in the past that users wonder why they see two /sys/fs/cgroup mounts in their unprivileged container. When working with unprivileged containers in Podman, users often notice two /sys/fs/cgroup mounts if the container is not using a new network namespace. The duplication is not a bug but an intentional consequence of how the kernel handles bind mounts that cross user namespace boundaries, combined with the need to provide the container with a writable cgroup view that is scoped to its own slice.

[read more]

Hide the current process executable file

I have been working on a new functionality for the prctl syscall that addresses a common security concern with container runtimes. The /proc/self/exe symlink, which points to the executable of the running process, was the key ingredient in CVE-2019-5736, a vulnerability that allowed a malicious container to overwrite the container runtime binary on the host. The workaround deployed at the time — re-execing from a copy or using a read-only bind mount — treats the symptom rather than the cause.

[read more]

The journey to speed up running OCI containers

When I started working on crun, I was looking at a faster way to start up and stop containers by improving the OCI runtime, the component in the OCI stack that is responsible for talking to the kernel and setting up the environment where the container runs. Over roughly five years, a combination of kernel patches and userspace fixes reduced the time to start and stop a container from around 160 ms to just over 5 ms — nearly a 30x improvement — through targeted work on network namespace teardown, mqueue mount overhead, IPC namespace cleanup, and seccomp profile compilation.

[read more]

An interesting issue handling the seccomp listener

A bug report filed against crun a few days ago exposed a deadlock: under certain seccomp profiles, the runtime would hang indefinitely before the container process ever started. The root cause is a subtle sequencing problem between installing a seccomp filter that intercepts a syscall and then immediately using that same syscall to hand off the resulting listener file descriptor to the userspace handler — the very handler that has not yet received the descriptor it needs to process the interception.

[read more]

Composefs - a file system for container images

For the last couple of weeks, I’ve been playing on a PoC implementation of a file system for the Linux kernel. The goal is to address a fundamental limitation in how container images are stored: the existing overlay model deduplicates at the layer level, but once you want per-file deduplication — so that identical files across different images share a single copy on disk and in the page cache — the current architecture gets in the way and requires awkward workarounds involving hard links or filesystem-specific reflinks.

[read more]

Seccomp made easy

Seccomp is a kernel feature that restricts what syscalls can be used by a process. The allowed syscalls are described as a BPF program that the kernel evaluates on every syscall entry. While effective, writing and maintaining seccomp profiles in the JSON format expected by OCI runtimes is tedious, and the underlying libseccomp API has surprising constraints — particularly around combining per-argument rules for the same syscall — that make complex policies difficult to express correctly.

Almost every container runs with seccomp enabled to restrict its access to syscalls.

[read more]

Cgroup v2 OOM group

One annoying issue with setting a memory limit for a container is that the OOM killer can leave the container in an inconsistent state with only some of its processes terminated. When a cgroup hits its memory limit, the kernel selects a single process to kill based on a badness score, not all the processes in the cgroup. This means that a multi-process container — for example, one running a web server and several worker processes — may continue running in a broken state after the OOM event rather than being cleanly torn down.

[read more]

Playing with seccomp notifications in the OCI runtime

A couple weekends ago I’ve played with seccomp user notifications and how they can be used in the OCI containers stack. Seccomp user notifications are a Linux kernel feature that lets a privileged monitor process intercept specific syscalls made by a less-privileged container, inspect the arguments, and either emulate the syscall or return an error. This opens up possibilities for safely expanding what unprivileged containers can do — for example, emulating mknod — without granting broad kernel capabilities to the container itself.

Seccomp user notifications are a powerful Linux kernel feature, that delegates syscalls handling to a userland program.

[read more]