an interesting issue handling the seccomp listener


an interesting issue was opened for crun a couple of days ago.

The issue reports that:

runc (v1.1.4) accepts the following .linux.seccomp configuration (sendmsg is in the SCMP_ACT_NOTIFY list), but crun (v1.5, also tested v0.19) just hangs.

    "seccomp": {
      "defaultAction": "SCMP_ACT_ALLOW",
      "listenerPath": "/tmp/foo.sock",
      "syscalls": [
        {
          "names": [
            "sendmsg"
          ],
          "action": "SCMP_ACT_NOTIFY"
        }
      ]
    }

seccomp has a feature, the user-space notifications, that allows to intercept syscalls and handle them in a custom way in userspace. If the flags argument passed to the seccomp(2) syscall contains the SCMP_ACT_NOTIFY flag, then the kernel will open a file descriptor and returns it to the caller. The file descriptor is used to receive notifications from the kernel for the syscalls intercepted.

The OCI runtime doesn't handle these notifications directly, so the file descriptor is passed to a different process.

In the OCI configuration file consumed by the OCI runtime, the listenerPath is the path to a UNIX socket that will receive the seccomp listener file descriptor once crun has it.

What crun does and that has caused the error, was to naively use sendmsg(2) to send the listener fd to the specified socket, and do that just after the seccomp filter was installed, so the sendmsg call itself is intercepted but no process has access to the file descriptor and the call hangs.

What to do?

The problem we need to solve is to send the file descriptor from an environment where the sendmsg is not blocked,

This is easily achieved with a helper process, that is created just before the seccomp filter is installed. The helper process will be responsible to send the file descriptor to the specified socket.

From the issue report, it seems that runc has already solved the problem by using a pipe to inform the helper process on what fd contains the seccomp listener and then let the helper process retrieve the file descriptor with the pidfd_getfd(2) syscall.

Two issues with this approach are:

  • it requires a new kernel feature, pidfd_getfd(2).
  • it still expects write(2) to not be filtered by seccomp.

The first issue can be solved by using a different approach, instead of using pidfd_getfd(2), we can fork the helper process with the CLONE_FILES flag, so the helper process will have the same file descriptors as the parent process!

We still need to solve the second issue, but we can do that by using a shared memory region and let the helper process do a busy loop on the region until it contains the file descriptor number.

Shared memory

The shared memory region is backed by a memfd created as:

      memfd = memfd_create ("seccomp-helper-memfd", O_RDWR);
      if (UNLIKELY (memfd < 0))
        return crun_make_error (err, errno, "memfd_create");

      ret = ftruncate (memfd, sizeof (atomic_int));
      if (UNLIKELY (ret < 0))
        return crun_make_error (err, errno, "ftruncate seccomp memfd");

      ret = libcrun_mmap (&mmap_region, NULL, sizeof (atomic_int),
                          PROT_WRITE | PROT_READ, MAP_SHARED, memfd, 0, err);
      if (UNLIKELY (ret < 0))
        return ret;

The first block creates the memfd file, the second one resizes it to the size of an atomic int and the third one maps it in memory.

Helper process

Now that there is a way for the two processes to communicate without using any syscall we can look at the helper process, that just does:

      helper_proc = syscall_clone (CLONE_FILES | SIGCHLD, NULL);
      if (UNLIKELY (helper_proc < 0))
        return crun_make_error (err, errno, "clone seccomp listener helper process");

      if (helper_proc == 0)
        {
          int fd;

          prctl (PR_SET_PDEATHSIG, SIGKILL);
          for (;;)
            {
              fd = *fd_received;
              if (fd == -1)
                {
                  usleep (1000);
                  continue;
                }
              break;
            }
          ret = send_fd_to_socket_with_payload (listener_receiver_fd, fd,
                                                receiver_fd_payload,
                                                receiver_fd_payload_len,
                                                err);
          if (UNLIKELY (ret < 0))
            _exit (crun_error_get_errno (err));
          _exit (0);
        }

the prctl(2) call is used to make sure that the helper process won't survive its parent process.

Once the fd is retrieved from the shared memory region, the send_fd_to_socket_with_payload function sends it to the receiver socket using the sendmsg(2) syscall.

Main process

The main process, the one that will be eventually execve the container program, just does:

  ret = syscall_seccomp (SECCOMP_SET_MODE_FILTER, flags, &seccomp_filter);
  if (UNLIKELY (ret < 0))
    return crun_make_error (err, errno, "seccomp (SECCOMP_SET_MODE_FILTER)");
  if (listener_receiver_fd >= 0)
    {
      atomic_int *fd_to_send = mmap_region->addr;
      int status = 0;

      *fd_to_send = listener_fd = ret;

      ret = waitpid (helper_proc, &status, 0);
      ...
    }

The syscall_seccomp function is a wrapper around the seccomp(2) syscall to install the seccomp filter and retrieve the listener fd.

The *fd_to_send = ret; assignment writes the listener file descriptor to the shared memory and that the helper process will consume.

Conclusion

With all of this in place, crun accepts a seccomp profile with no limitations on what syscalls can be intercepted with SCMP_ACT_NOTIFY. The notified process, that receives the seccomp listener, must still ensure that all syscalls until the execve(2) syscall are allowed, otherwise the OCI runtime will fail to start the container.