Posts for: #Seccomp

The journey to speed up running OCI containers

When I started working on crun, I was looking at a faster way to start up and stop containers by improving the OCI runtime, the component in the OCI stack that is responsible for talking to the kernel and setting up the environment where the container runs. Over roughly five years, a combination of kernel patches and userspace fixes reduced the time to start and stop a container from around 160 ms to just over 5 ms — nearly a 30x improvement — through targeted work on network namespace teardown, mqueue mount overhead, IPC namespace cleanup, and seccomp profile compilation.

[read more]

An interesting issue handling the seccomp listener

A bug report filed against crun a few days ago exposed a deadlock: under certain seccomp profiles, the runtime would hang indefinitely before the container process ever started. The root cause is a subtle sequencing problem between installing a seccomp filter that intercepts a syscall and then immediately using that same syscall to hand off the resulting listener file descriptor to the userspace handler — the very handler that has not yet received the descriptor it needs to process the interception.

[read more]

Seccomp made easy

Seccomp is a kernel feature that restricts what syscalls can be used by a process. The allowed syscalls are described as a BPF program that the kernel evaluates on every syscall entry. While effective, writing and maintaining seccomp profiles in the JSON format expected by OCI runtimes is tedious, and the underlying libseccomp API has surprising constraints — particularly around combining per-argument rules for the same syscall — that make complex policies difficult to express correctly.

Almost every container runs with seccomp enabled to restrict its access to syscalls.

[read more]

Playing with seccomp notifications in the OCI runtime

A couple weekends ago I’ve played with seccomp user notifications and how they can be used in the OCI containers stack. Seccomp user notifications are a Linux kernel feature that lets a privileged monitor process intercept specific syscalls made by a less-privileged container, inspect the arguments, and either emulate the syscall or return an error. This opens up possibilities for safely expanding what unprivileged containers can do — for example, emulating mknod — without granting broad kernel capabilities to the container itself.

Seccomp user notifications are a powerful Linux kernel feature, that delegates syscalls handling to a userland program.

[read more]