seccomp made easy


seccomp is a kernel feature that restricts what syscalls can be used by a process.

Almost every container runs with seccomp enabled to restrict its access to syscalls.

The seccomp profile defined for a container is finally converted to a BPF program that the kernel runs on each syscall to decide whether to allow it and how to handle it.

An OCI runtime, such as crun or runc, gets the seccomp configuration as part of the OCI JSON configuration file, then generate the BPF program using libseccomp.

Instead OCI container engines, such as Podman, CRI-O or Moby, use a higher level JSON file to define the seccomp profile, this profile will be then converted to the configuration passed to the OCI runtime.

The higher level configuration permits to customize the seccomp profile in relation to the container configuration. For example, it is possible to allow or deny a syscall only when a specific capability is also granted to the container.

Writing a seccomp profile in JSON is painful on its own, but there are also several limitations in the libseccomp API that make it even more difficult to write. As the seccomp_rule_add(3) man page says:

All of the filter rules supplied by the calling application are
combined into a union, with additional  logic to  eliminate redundant
syscall filters.  For example, if a rule is added which allows a given
syscall with a specific set of argument values and later a rule is
added which allows the same syscall regardless the argument values
then the first, more specific rule, is effectively dropped from the
filter by the second more generic rule.

This behavior turns out to be quite difficult to handle if the syscall should be treated in different ways.

If you don't believe it, look at what we had to do just to return EINVAL when the first argument to the socket syscall is equal to 16 and the third one to 9 (why it is done this way is a topic for another time):

https://github.com/containers/common/blob/26511cc1709f80bc3de89edcfc4ac465fb21c106/pkg/seccomp/seccomp.json#L728-L832

This also doesn't scale, if we want to add another condition we would need to provide the configuration for each combination of values.

So what to do?

Last week I've started working on easyseccomp. It is still a PoC but it seems to work already quite well.

The goal is to have an easier to use language to define a seccomp profile.

libseccomp is not used to generate the BPF bytecode (altough it is still needed to lookup the syscall numbers).

To give an example, the socket syscall example above would look like:

#ifndef CAP_AUDIT_WRITE
$syscall == @socket && $arg0 == 16 && $arg2 == 9 => ERRNO(EINVAL);
#endif
$syscall == @socket => ALLOW();

That's it.

How to use it?

Since the seccomp configuration is passed to the OCI runtime as part of the OCI configuration file and doesn't allow any customization, we need (at least for now) a side channel to pass it. Annotations are a mechanism to pass arbitrary information to the OCI runtime. I've added a custom annotation to crun. When the annotation is present, crun ignores the seccomp configuration in the OCI file and load the raw BPF bytecode from the specified file.

The PR is here: (https://github.com/containers/crun/pull/578).

Once the BPF filter is generated by easyseccomp, the raw result can be specified to crun using the new annotation, e.g. from Podman it is possible to do:

$ easyseccomp < config > /path/to/the/filter.bpf
$ podman run --annotation run.oci.seccomp_bpf_file=/path/to/the/filter.bpf ...

The container engine has all the logic to convert the high level JSON configuration to the OCI version, including the logic of looking at what capabilities are granted to the container.

For now we need to take care of this step when the easyseccomp profile is generated.

easyseccomp supports customizations of the profile with a mechanism similar to the C preprocessor:

#ifndef CAP_AUDIT_WRITE
$syscall == @socket && $arg0 == 16 && $arg2 == 9 => ERRNO(EINVAL);
#endif

These definitions can be specified to easyseccomp:

$ easyseccomp -d CAP_AUDIT_WRITE
$ easyseccomp

If CAP_AUDIT_WRITE is not specified to easyseccomp then the code between the #ifndef directive and the #endif is ignored.

Conversely, #ifdef DIRECTIVE permits to specify code that is included only when the specified DIRECTIVE is present.

The #if(n)def/#endif directive mechanism is a replacement for the excludes/includes rules used in the JSON file.

To facilitate the conversion between an existing JSON configuration file and the new language, I've added a Python script convert-from-containers-policy.py that can be used as:

$ convert-from-containers-policy.py < /usr/share/containers/seccomp.json > seccomp_profile

The conversion is best-effort, but it is a good starting point.

Given the new profile, the BPF can be generated (assuming running on AMD64) as:

$ easyseccomp -d ARCH_AMD64 > generated_bpf
$ podman run --annotation run.oci.seccomp_bpf_file=/path/to/the/generated_bpf ...

Generated BPF

Running with seccomp enabled has a runtime overhead on each syscall performed by a process. The overhead depends on the generated BPF.

The BPF generated by easyseccomp, at least the one created from the profile above, seems to perform better than what libseccomp does.

On my machine, using the kernel 5.9.16-200.fc33.x86_64 and a crun version that support loading the raw BPF filter, I've used this simple C program to benchmark the seccomp overhead:

#include <unistd.h>
#include <sys/syscall.h>

int
main ()
{
  int i;

  for (i = 0; i < 10000000; i++)
    {
      syscall (SYS_getpid);
      syscall (SYS_getuid);
      syscall (SYS_getgid);
    }
  return 0;
}

and I get:

$ podman run --security-opt seccomp=unconfined seccomp-benchmark sh -c 'time /usr/local/bin/benchmark'

real	0m1.682s
user	0m0.683s
sys	0m0.993s

$ podman run seccomp-benchmark sh -c 'time /usr/local/bin/benchmark'

real	0m2.979s
user	0m0.721s
sys	0m2.249s

$ podman run --annotation run.oci.seccomp_bpf_file=/tmp/generated_bpf seccomp-benchmark sh -c 'time /usr/local/bin/benchmark'

real	0m2.591s
user	0m0.646s
sys	0m1.938s

The first command disables seccomp, while the second one uses the version generated by libseccomp and the third one by easyseccomp.

EDIT:

Linux 5.11 has constant-action bitmaps for seccomp, thus the performance in the example above is the same for both libseccomp and easyseccomp versions. Since the constant-action kernel optimization works only for ALLOW rules, the smaller BPF generated by easyseccomp (using the containers default profile, it is down to 20% of the libseccomp version) still performs better in all other cases.