use bubblewrap as an unprivileged user to run systemd images

bubblewrap is a sandboxing tool that allows unprivileged users to run containers. I was recently working on a way to allow unprivileged users, to take advantage of bubblewrap to run regular system images that are using systemd. To do so, it was necessary to modify bubblewrap to keep some capabilities in the sandbox.

Capabilities are the way, since Linux 2.2, that the kernel uses to split the root power into a finer grained set of permissions that each thread can have. Together with Linux namespaces it is fine to leave unprivileged users the possibility to use some of them. To give an example, CAP_SETUID, which allows the calling process to make manipulations of process UIDs, is fine to be used in a new user namespace as the set of permitted UIDs is restricted to those UIDs that exist in the new user namespace.

The changes required in bubblewrap are not yet merged upstream. In the rest of post I will refer to the modified bubblewrap simply as bubblewrap.

The patches for bubblewrap are available here: https://github.com/giuseppe/bubblewrap/compare/privileged-systemd, this is the version used for the test. There is already a pull request for these changes to get merged in.

The set of capabilities that bubblewrap leaves in the process is regulated with –cap-add, new namespaces are required to use these caps. The special value ALL, adds all the caps that are allowed by bubblewrap.

A development version of systemd is required to run in the modified bubblewrap. There are patches in systemd upstream that allows systemd to run without requiring CAP_AUDIT_* and to not fail when **setgroups** is disabled, as it is the case when running inside bubblewrap (to address CVE-2014-8989). The **setgroups** restriction may be lifted in future in some cases, this is still under discussion.

For my tests, I’ve used Docker to compose the container, in the following Dockerfile there are no metadata directives as anyway they are not used when exporting the rootfs.

FROM fedora
RUN dnf -y install httpd; dnf clean all; systemctl enable httpd.service

To compose the container and export its content to a directory rootfs, you can do as root:

docker build -t httpd .
docker start --name=httpd httpd
mkdir rootfs
cd rootfs
docker export httpd | tar xf -
rootfs=$(pwd)

To install the latest systemd, once you’ve cloned its repository, from the source directory you can simply do:

./autogen.sh
make -j $(nproc)
make install DESTDIR=$(rootfs)

to install it in the container rootfs.

If the files /etc/subuid and /etc/subgid are present, the first interval of additional UIDs and GIDs for the unprivileged user invoking bubblewrap is used to set the additional users and groups available in the container. This is required for the system users needed for systemd.

At this point, everything is in place and we can use bubblewrap to run the new container as an unprivileged user:

bwrap --uid 0 --gid 0 --bind rootfs / --sys /sys  --proc /proc --dev /dev --ro-bind /sys/fs/cgroup /sys/fs/cgroup --bind /sys/fs/cgroup/systemd /sys/fs/cgroup/systemd --ro-bind /sys/fs/cgroup/cpuset /sys/fs/cgroup/cpuset --ro-bind /sys/fs/cgroup/hugetlb /sys/fs/cgroup/hugetlb --ro-bind /sys/fs/cgroup/devices /sys/fs/cgroup/devices --ro-bind /sys/fs/cgroup/cpu,cpuacct /sys/fs/cgroup/cpu,cpuacct --ro-bind /sys/fs/cgroup/freezer /sys/fs/cgroup/freezer --ro-bind /sys/fs/cgroup/pids /sys/fs/cgroup/pids --ro-bind /sys/fs/cgroup/blkio /sys/fs/cgroup/blkio --ro-bind /sys/fs/cgroup/net_cls,net_prio /sys/fs/cgroup/net_cls,net_prio --ro-bind /sys/fs/cgroup/perf_event /sys/fs/cgroup/perf_event --ro-bind /sys/fs/cgroup/memory /sys/fs/cgroup/memory --bind /sys/fs/cgroup/systemd /sys/fs/cgroup/systemd  --tmpfs /dev/shm --mqueue /dev/mqueue --dev-bind /dev/tty /dev/tty --chdir / --setenv PATH /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin --setenv TERM xterm --setenv container docker  --tmpfs /var --tmpfs /run --tmpfs /tmp --tmpfs /var/www/html --tmpfs /var/log/httpd --bind rootfs/etc /etc  --hostname systemd --unshare-pid --unshare-net --unshare-ipc --unshare-user --unshare-uts --remount-ro / --cap-add ALL --no-reaper /usr/lib/systemd/systemd --system

systemd uses the signal SIGRTMIN+3 to terminate its execution, to kill the bubblewrap container, you can use kill -37 $PID, where $PID is the systemd process in the container.