A couple weekends ago I've played with seccomp user notifications and how they can be used in the OCI containers stack.
Seccomp user notifications are a powerful Linux kernel feature, that delegates syscalls handling to a userland program.
Conceptually seccomp notifications work in a similar way to FUSE for file systems.
The notified program, that usually runs with higher privileges than the watched process, is notified when some syscalls are performed. It is then expected to handle the notification and report back the result. The set of syscalls that are notified, is specified in the seccomp profile.
One interesting use case is to delegate syscalls like
mknod from an
unprivileged container to the more privileged handler that can emulate
it either through calling
mknod if it has enough privileges for
doing that, or through a bind mount.
If you are interested in more details, I'd suggest taking a look at the great blog post The Seccomp Notifier – New Frontiers in Unprivileged Container Development.
The main difficulty with using this feature is that it requires a daemon-like process to handle these notifications.
In the OCI world, such a daemon is not standardized and every container engine developed a different (and incompatible) way of monitoring container processes. Podman and CRI-O, for example, use a small C program conmon to monitor the container and record its exit status.
This seems like the natural place where the seccomp notifications should be handled, so it is not necessary to create yet another one.
Setting up the seccomp file descriptor
The OCI runtime is ultimately responsible for setting up the seccomp profile for the container, and when this happens, it can also ask the kernel to create the file descriptor where notifications are received.
That is done setting the
SECCOMP_FILTER_FLAG_NEW_LISTENER flag to
There is already a proposal for adding support for seccomp notifications to the OCI runtime specs: Seccomp userspace notifications PR, so this will likely be implemented by all the OCI runtimes in a compatible way.
Until that happens though, I've added a custom annotation to the crun
OCI runtime for specifying a socket where to send the seccomp
notifications file descriptor once it is created. If the annotation
run.oci.seccomp.receiver=PATH or the environment variable
RUN_OCI_SECCOMP_RECEIVER=PATH is set, crun creates the seccomp
listener file descriptor and write it to the specified path that is
expected to be a UNIX socket. The idea is that conmon configures the
UNIX socket, specifies the
variable and it gets back the seccomp notification from crun.
Setting the seccomp listener file descriptor is only one half of the
problem. In addition to doing that, it is necessary to specify what
syscalls are going to be intercepted and that is done at a much higher
level in the OCI stack.
Podman and CRI-O maintain a default seccomp profile at
/usr/share/containers/seccomp.json, that can be overriden per each
Each syscall to intercept must be specified by setting its action to
How to handle these notifications?
The most interesting part is how to handle these notifications? There are so many possible ways they can be handled that it seemed to difficult to hardcode a specific behavior either in the OCI runtime or in conmon. So I've opted for a plugins mechanism that allows users to load and use different plugins for handling the notifications, taking out the responsibility from the OCI runtime and the conmon program.
The API is still under discussion but currently it looks like:
typedef int (*run_oci_seccomp_notify_start_cb)(void **opaque, struct libcrun_load_seccomp_notify_conf_s *conf, size_t size_configuration);
/* Try to handle a single request. It MUST be defined.
HANDLED specifies how the request was handled by the plugin:
0: not handled, try next plugin or return ENOTSUP if it is the last plugin.
RUN_OCI_SECCOMP_NOTIFY_HANDLE_SEND_RESPONSE: sresp filled and ready to be notified to seccomp.
RUN_OCI_SECCOMP_NOTIFY_HANDLE_DELAYED_RESPONSE: the notification will be handled internally by the plugin and forwarded to seccomp_fd. It is useful for asynchronous handling.
typedef int (*run_oci_seccomp_notify_handle_request_cb)(void *opaque, struct seccomp_notif_sizes *sizes, struct seccomp_notif *sreq, struct seccomp_notif_resp *sresp, int seccomp_fd, int *handled);
/* Stop the plugin. The opaque value is the return value from run_oci_seccomp_notify_start. */
typedef int (*run_oci_seccomp_notify_stop_cb)(void *opaque);
/* Retrieve the API version used by the plugin. It MUST return 1. */
typedef int (*run_oci_seccomp_notify_plugin_version_cb)();
These methods, exposed by a plugin, are called by the conmon process whenever it receives a seccomp notification.
run_oci_seccomp_notify_start_cb is called at startup and allows
the plugin to do its initial configuration and register an opaque
pointer to maintain its state. The opaque pointer is used for any
other request to the plugin.
When a notification is received, the plugin is notified through the
run_oci_seccomp_notify_handle_request_cb callback. The plugin is
expected to set the
*handled pointer to one of these possible
/* The plugin doesn't know how to handle the request. */
# define RUN_OCI_SECCOMP_NOTIFY_HANDLE_NOT_HANDLED 0
/* The plugin filled the response and it is ready to write. */
# define RUN_OCI_SECCOMP_NOTIFY_HANDLE_SEND_RESPONSE 1
/* The plugin will handle the request and write directly to the fd. */
# define RUN_OCI_SECCOMP_NOTIFY_HANDLE_DELAYED_RESPONSE 2
/* Specify SECCOMP_USER_NOTIF_FLAG_CONTINUE in the flags. */
# define RUN_OCI_SECCOMP_NOTIFY_HANDLE_SEND_RESPONSE_AND_CONTINUE 3
that tells the plugins handler how the notification was handled.
If the notification was not handled by the plugin, the plugins handler
will try another plugin and if no plugin was able to satisfy the
request the syscall fails with
Plugins won't have to worry about setting up the seccomp watcher or how to retrieve it from the OCI runtime. Another advantage of such plugins mechanism is that handlers are not limited to be written in C, as long as they can satisfy the C ABI. In facts I've added an example plugin fully written in Rust.
It is needed to load these plugins at startup time. That is currently
implemented in two ways, either as an explicit command line argument
--seccomp-notify-plugins=plugin-a.so:plugin-b.so or using an
CONMON_SECCOMP_NOTIFY_PLUGINS. The env
variable makes easier to setup a static configuration used
by all containers. In the
/etc/containers/containers.conf there is
a configuration for overriding environment variables for conmon.
RUN_OCI_SECCOMP_NOTIFY_HANDLE_DELAYED_RESPONSE mode tells the
plugins manager that the seccomp notification will be handled in an
asynchronous way by the plugin. The plugin is responsible for writing
directly to the seccomp listener file descriptor once the response is
Debugging plugins loaded by conmon can be difficult, since the conmon process forks and runs in the background. Plugins run from that same context.
To facilitate debugging, I've added yet another annotation to crun:
run.oci.seccomp.plugins=PATH that works in the same way as the
setting conmon uses. The plugins are handled by crun, using the same
API. Since crun doesn't fork itself, at least not the part that waits
for the container process and handles notifications/plugins, it makes
much easier to debug plugins. Another point is that these plugins are
essentially low level handlers and they can potentially share code
with the crun OCI runtime so they can live in the same code
Standardize the API?
It would be nice if other implementations could use the same API or standardize on a similar one that conmon and crun could adopt. It would make easier to share seccomp handlers among different runtimes.
Open Pull Requests