playing with seccomp notifications in the OCI runtime

A couple weekends ago I've played with seccomp user notifications and how they can be used in the OCI containers stack.

Seccomp user notifications are a powerful Linux kernel feature, that delegates syscalls handling to a userland program.

Conceptually seccomp notifications work in a similar way to FUSE for file systems.

The notified program, that usually runs with higher privileges than the watched process, is notified when some syscalls are performed. It is then expected to handle the notification and report back the result. The set of syscalls that are notified, is specified in the seccomp profile.

One interesting use case is to delegate syscalls like mknod from an unprivileged container to the more privileged handler that can emulate it either through calling mknod if it has enough privileges for doing that, or through a bind mount.

If you are interested in more details, I'd suggest taking a look at the great blog post The Seccomp Notifier – New Frontiers in Unprivileged Container Development.

The main difficulty with using this feature is that it requires a daemon-like process to handle these notifications.

In the OCI world, such a daemon is not standardized and every container engine developed a different (and incompatible) way of monitoring container processes. Podman and CRI-O, for example, use a small C program conmon to monitor the container and record its exit status.

This seems like the natural place where the seccomp notifications should be handled, so it is not necessary to create yet another one.

Setting up the seccomp file descriptor

The OCI runtime is ultimately responsible for setting up the seccomp profile for the container, and when this happens, it can also ask the kernel to create the file descriptor where notifications are received.

That is done setting the SECCOMP_FILTER_FLAG_NEW_LISTENER flag to the seccomp syscall.

There is already a proposal for adding support for seccomp notifications to the OCI runtime specs: Seccomp userspace notifications PR, so this will likely be implemented by all the OCI runtimes in a compatible way.

Until that happens though, I've added a custom annotation to the crun OCI runtime for specifying a socket where to send the seccomp notifications file descriptor once it is created. If the annotation run.oci.seccomp.receiver=PATH or the environment variable RUN_OCI_SECCOMP_RECEIVER=PATH is set, crun creates the seccomp listener file descriptor and write it to the specified path that is expected to be a UNIX socket. The idea is that conmon configures the UNIX socket, specifies the RUN_OCI_SECCOMP_RECEIVER environment variable and it gets back the seccomp notification from crun.

Seccomp profile

Setting the seccomp listener file descriptor is only one half of the problem. In addition to doing that, it is necessary to specify what syscalls are going to be intercepted and that is done at a much higher level in the OCI stack. Podman and CRI-O maintain a default seccomp profile at /usr/share/containers/seccomp.json, that can be overriden per each container through --security-opt seccomp=/path/to/profile.json.

Each syscall to intercept must be specified by setting its action to SCMP_ACT_NOTIFY.

How to handle these notifications?

The most interesting part is how to handle these notifications? There are so many possible ways they can be handled that it seemed to difficult to hardcode a specific behavior either in the OCI runtime or in conmon. So I've opted for a plugins mechanism that allows users to load and use different plugins for handling the notifications, taking out the responsibility from the OCI runtime and the conmon program.

Plugins API

The API is still under discussion but currently it looks like:

typedef int (*run_oci_seccomp_notify_start_cb)(void **opaque, struct libcrun_load_seccomp_notify_conf_s *conf, size_t size_configuration);

/* Try to handle a single request.  It MUST be defined.
   HANDLED specifies how the request was handled by the plugin:
   0: not handled, try next plugin or return ENOTSUP if it is the last plugin.
   RUN_OCI_SECCOMP_NOTIFY_HANDLE_SEND_RESPONSE: sresp filled and ready to be notified to seccomp.
   RUN_OCI_SECCOMP_NOTIFY_HANDLE_DELAYED_RESPONSE: the notification will be handled internally by the plugin and forwarded to seccomp_fd. It is useful for asynchronous handling.
typedef int (*run_oci_seccomp_notify_handle_request_cb)(void *opaque, struct seccomp_notif_sizes *sizes, struct seccomp_notif *sreq, struct seccomp_notif_resp *sresp, int seccomp_fd, int *handled);

/* Stop the plugin.  The opaque value is the return value from run_oci_seccomp_notify_start.  */
typedef int (*run_oci_seccomp_notify_stop_cb)(void *opaque);

/* Retrieve the API version used by the plugin.  It MUST return 1. */
typedef int (*run_oci_seccomp_notify_plugin_version_cb)();

These methods, exposed by a plugin, are called by the conmon process whenever it receives a seccomp notification.

The run_oci_seccomp_notify_start_cb is called at startup and allows the plugin to do its initial configuration and register an opaque pointer to maintain its state. The opaque pointer is used for any other request to the plugin.

When a notification is received, the plugin is notified through the run_oci_seccomp_notify_handle_request_cb callback. The plugin is expected to set the *handled pointer to one of these possible values:

/* The plugin doesn't know how to handle the request.  */
/* The plugin filled the response and it is ready to write.  */
/* The plugin will handle the request and write directly to the fd.  */
/* Specify SECCOMP_USER_NOTIF_FLAG_CONTINUE in the flags.  */

that tells the plugins handler how the notification was handled.

If the notification was not handled by the plugin, the plugins handler will try another plugin and if no plugin was able to satisfy the request the syscall fails with ENOTSUP.

Plugins won't have to worry about setting up the seccomp watcher or how to retrieve it from the OCI runtime. Another advantage of such plugins mechanism is that handlers are not limited to be written in C, as long as they can satisfy the C ABI. In facts I've added an example plugin fully written in Rust.

It is needed to load these plugins at startup time. That is currently implemented in two ways, either as an explicit command line argument or using an environment variable CONMON_SECCOMP_NOTIFY_PLUGINS. The env variable makes easier to setup a static configuration used by all containers. In the /etc/containers/containers.conf there is a configuration for overriding environment variables for conmon.

The RUN_OCI_SECCOMP_NOTIFY_HANDLE_DELAYED_RESPONSE mode tells the plugins manager that the seccomp notification will be handled in an asynchronous way by the plugin. The plugin is responsible for writing directly to the seccomp listener file descriptor once the response is ready.

Debugging plugins

Debugging plugins loaded by conmon can be difficult, since the conmon process forks and runs in the background. Plugins run from that same context.

To facilitate debugging, I've added yet another annotation to crun: run.oci.seccomp.plugins=PATH that works in the same way as the setting conmon uses. The plugins are handled by crun, using the same API. Since crun doesn't fork itself, at least not the part that waits for the container process and handles notifications/plugins, it makes much easier to debug plugins. Another point is that these plugins are essentially low level handlers and they can potentially share code with the crun OCI runtime so they can live in the same code repository.

Standardize the API?

It would be nice if other implementations could use the same API or standardize on a similar one that conmon and crun could adopt. It would make easier to share seccomp handlers among different runtimes.

Open Pull Requests

crun PR conmon PR