network namespaces for unprivileged users

a couple of weekends ago I’ve played with libslirp and put together slirp-forwarder.

SliRP emulates in userspace a TCP/IP stack. It can be used to circumvent the limitation of creating TAP/TUN devices in the host namespace for an unprivileged user. The program could run in the host namespace, receive messages from the network namespace where a TAP device is configured, and forward them to the outside world using unprivileged operations such as opening another connection to the destination host. Privileged operations are still not possible outside of the emulated network, as the helper program doesn’t gain any additional privilege that running as an unprivileged user.

Once the PoC was ready, I discovered there was already another tool by Akihiro Suda (@AkihiroSuda), slirp4netns that was doing exactly the same thing, and it was already using the better slirp implementation in QEMU, that is used for configuring unprivileged virtual machines.

slirp4netns was added to the rootlesscontainers github organization, and its repo can be found here:

With some small changes to slirp4netns, it was possible to integrate slirp4netns into Podman for the configuration of an unprivileged network namespace. For example, we needed a way to terminate the slirp4netns program once the container exits, allow to configure the interface and notify Podman back once the configuration is done.

$ podman run --rm alpine ifconfig -a
lo        Link encap:Local Loopback  
          inet addr:  Mask:
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

tap0      Link encap:Ethernet  HWaddr CE:CE:E1:0A:4B:F9  
          inet addr:  Bcast:  Mask:
          inet6 addr: fe80::ccce:e1ff:fe0a:4bf9/64 Scope:Link
          UP BROADCAST RUNNING  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:90 (90.0 B)

This is how it looks from the host, the arguments to slirp4netns in addition to some fd used for the synchronization, are the PID of a process in the network namespace to configure and the device name.

$ bin/podman run --rm alpine sleep 10 &
[1] 10360
$ pgrep -fa slirp
10460 /usr/bin/slirp4netns -c -e 3 -r 4 10447 tap0

become-root in an user namespace

I’ve cleaned up some C files I was using locally for hacking with user namespaces and uploaded them to a new repository on github:

Creating an user namespace can be easily done with unshare(1) and get the current user mapped to root with unshare -r COMMAND but it doesn’t support the mapping of multiple uids/gids. For doing that it is necessary to use the suid newuidmap and newgidmap tools, that allocates multiple uids/gids to unprivileged users accordingly to the configuration files:

  • /etc/subuid: for additional UIDs
  • /etc/subgid: for additional GIDs
    • $ grep gscrivano /etc/subuid
      $ become-root cat /proc/self/uid_map 
               0       1000          1
               1     110000      65536

      The uid_map file under /proc shows the mappings used by the process.

      become-root doesn’t allow any customization, it statically maps the current user to the root in the user namespace and any additional uid/gid are mapped starting from 1.

      One feature that might be nice to have is to allow the creation of other namespaces as part of the same unshare syscall, such as creating a mount or network namespace, but I’ve not added this feature as I am not using it, I rely on unshare(1) for more features. PR are welcome.

fuse-overlayfs moved to

The project I was working on in the last weeks was moved under the umbrella.

With Linux 4.18 it will be possible to mount a FUSE file system in an user namespace. fuse-overlayfs is an implementation in user space of the overlay file system already present in the Linux kernel, but that can be mounted only by the root user. Union file systems were around for a long time, allowing multiple layers to be stacked on top of each other where usually the last one is the only writeable.
Overlay is an union file system widely used for mounting OCI image. Each OCI image is made up of different layers, each layer can be used by different images. A list of layers, stacked on each other gives the final image that is used by a container. The last level, that is writeable, is specific for the container. This model enables different containers to use the same image that is accessible as read-only from the lower layers of the overlay file system.

The current implementation of the overlay file system is done directly in the kernel, at a very low level, allowing non privileged users to use it directly poses some security risks. In the longer term, once the security aspect is resolved, non privileged users will probably be able to mount directly an overlay file system.

For now, given the new feature in Linux 5.18, having an implementation of the overlay union in user space will enable rootless containers to use the same storage as containers running as root.

On Fedora Rawhide, where Linux 4.18 is available, it is already possible to take a taste of it with:

podman --storage-opt overlay2.fuse_program=/usr/bin/fuse-overlayfs run ...

The previous command tells podman to mount an overlay file system using the specified FUSE helper instead of mounting it directly through the kernel.

Current status (and problems) of running Buildah as non root

Having Buildah running in an user namespace opens the possibility of building container images as a not root user. I’ve done some work to get Buildah running in an user container.

There are still some open issues to get it fully working. The biggest open one is that overlayfs cannot be currently used as non root user. There is some work going on, but this will require changes in the kernel and the way extended attributes work for overlay. The alternative is far from ideal and it is to use the vfs storage driver, but it is a good starting point to get things moving and see how far we get. (Another possibility that doesn’t require changes in the kernel would be an OSTree storage for Buildah, but that is a different story).

Circumvented the first obstacle, the other big issue was to get a container, that is created for every buildah run command, the Buildah version of the RUN directive in a Dockerfile. That means run a container inside of a container.

The default runtime for atomic –user is bwrap-oci, a tool that converts a subset of the OCI configuration file to a command line for bubblewrap, the real engine for running the container. There is an open issue with bubblewrap, that as part of the container setup, move the container in a chroot. This will prevent further containers to be created as for the unshare(2) man page, you can get an EPERM if:

EPERM (since Linux 3.9)
CLONE_NEWUSER was specified in flags and the caller is in a chroot environment (i.e., the caller’s root directory does not match the root directory of the mount namespace in which it resides).

This problem is tracked here: Once that is merged, together with some other small changes in bwrap-oci I got the container running and bubblewrap could be used both as the runtime for running the Buildah container that for the runtime for managing the containers created by Buildah.

I wanted to give it a try with runc as well as the container runtime. There is a lot of development going on upstream for running containers as not root user, but it also failed to run in an user namespace when it tried to setup the cgroups.

To get a better understanding of what could the solution for having a full OCI runtime managing these containers, I wrote some patches for crun, partly because it is my pet project and also as it is still experimental, it is much easier to quickly throw a bunch of patches at it and not be worried to make someone sad. I’ve added some code to detect when the container is running in an user namespace and relax some error conditions to deal with the limitations in such environment. Even if the user id is 0 the runtime doesn’t still have full control of the system.

The container image that I’ve prepared is hosted on Docker hub at

Given you use the latest version crun from git and of the atomic CLI tool (that supports –runtime) you can run the container as:

$ atomic run --runtime /usr/bin/crun --storage ostree /host/$(pwd)/

The script looks very similar to the example on the Buildah github page. It is a shell script that looks like:

#!/bin/bash -x

export HOME=/host/$(pwd)

ctr1=`buildah --storage-driver vfs from --pull ${}`

buildah --storage-driver vfs run --runtime /host/usr/bin/crun --runtime-flag systemd-cgroup $ctr1 -- dnf  upgrade -y

buildah --storage-driver vfs run --runtime /host/usr/bin/crun --runtime-flag systemd-cgroup $ctr1 -- dnf install -y lighttpd

buildah --storage-driver vfs config $ctr1 --annotation ""

buildah --storage-driver vfs config $ctr1 --cmd "/usr/sbin/lighttpd -D -f /etc/lighttpd/lighttpd.conf"
buildah --storage-driver vfs config $ctr1 --port 80

buildah --storage-driver vfs commit $ctr1  giuseppe/lighttpd

We got very close, but it doesn’t work yet the last `commit` command fails as vfs got broken upstream: We’ve built a container in an user namespace, but we cannot share it with anyone šŸ™‚

New COPR repository for crun

I made a new COPR repository for CRUN so that it can be easily tested on Fedora:

To install crun on Fedora, it is enough to:

# dnf install 'dnf-command(copr)'
# dnf -y copr enable gscrivano/crun
# dnf install -y crun

a recent change in the atomic tool, which didn’t still get into a release, allows to easily override the OCI runtime for system containers. Assuming you are using atomic from the upstream repository, you can use crun as:

# atomic install --system ----runtime /usr/bin/crun
# systemctl start etcd

It will install etcd as a system container which runs through crun!

You might need to disable SELinux as the /usr/bin/crun executable is not yet labelled correctly.

C is a better fit for tools like an OCI runtime

I’ve spent some of the last weeks working on a replacement for runC, the most used/known OCI runtime for running containers. It might not be very well known, but it is a key component for running containers. Every Docker container ultimately runs through runC.

Having containers running through some common specs allow some pieces to be replaced without having any difference in behavior.

The OCI runtime specs describe how a container looks like once it is running, for instance it lists all the mount points, the capabilities left to the process, the process that must be executed, the namespaces to create and so on.

While the rest of the containers ecosystem is written in Go, from Docker to Kubernetes, I think that for such a low level tool C still makes more sense. runC itself uses C for its lower level tasks forking itself once the configuration done and setting up the environment in C before launching the container process.

I’ve tried running sequentially 100 times a container that runs only /bin/true and the results are quite good:

|                                       | crun      | runC      |  %    |
| 100 /bin/true (no network namespace)  | 0m4.449s  | 0m7.514s  | 40.7% |
| 100 /bin/true (new network namespace) | 0m15.850s | 0m18.986s | 16.5% |

Most of the time for running a container seems to be in the creation of a network namespace. I had expected some costs in the Go->C process handling but I am surprised by the results when the network namespace is not used as crun is almost double as fast as runC.

For the parsing of the OCI spec file crun uses

crun is still experimental and some features are missing, but if you are intested you can take a look here: and open a PR if you have any improvements

OpenShift on system containers

It is still an ongoing work not ready for production, but the upstream version of OpenShift origin has already an experimental support for running OpenShift Origin using system containers. The “latest” Docker image for origin, node and openvswitch, the 3 components we need, are automatically pushed to, so we can use these for our test. The rhel7/etcd system container image instead is pulled from the Red Hat registry.

This demo is based on these blog posts and with some differences for the provision of the VMs and obviously running system containers instead of Docker containers.

The files used for the provision and the configuration can also be found here:, if you find it easier than copy/paste from a web browser.

In order to give it a try, we need the latest version of openshift-ansible for the installation. Let’s use a known commit that worked for me.

$ git clone
$ git checkout a395b2b4d6cfd65e1a2fb45a75d72a0c1d9c65bc

To provision the VMs for the OpenShift cluster, I’ve used this simple Vagrantfile:

BOX_IMAGE = "fedora/25-atomic-host"

# Workaround for (which is not yet merged while writing this)
SCRIPT = "sed -i -e 's|^Defaults.*secure_path.*$|Defaults    secure_path = /sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin|g' /etc/sudoers"

Vagrant.configure("2") do |config|
  config.vm.define "master" do |subconfig|
    subconfig.vm.hostname = "master" :private_network, ip: ""
  (1..NODE_COUNT).each do |i|
    config.vm.define "node#{i}" do |subconfig|
      subconfig.vm.hostname = "node#{i}" :private_network, ip: "10.0.0.#{10 + i}"

  config.vm.synced_folder "/tmp", "/vagrant", disabled: 'true'
  config.vm.provision :shell, :inline  => SCRIPT = BOX_IMAGE

  config.vm.provider "libvirt" do |v|
    v.memory = 1024
    v.cpus = 2

  config.vm.provision "shell" do |s|
    ssh_pub_key = File.readlines(ENV['HOME'] + "/.ssh/").first.strip
    s.inline = <<-SHELL
      mkdir -p /home/vagrant/.ssh /root/.ssh
      echo #{ssh_pub_key} >> /home/vagrant/.ssh/authorized_keys
      echo #{ssh_pub_key} >> /root/.ssh/authorized_keys
      lvextend -L10G /dev/atomicos/root
      xfs_growfs -d /dev/mapper/atomicos-root

The Vagrantfile will provision three virtual machines based on the `fedora/25-atomic-host` image. One machine will be used for the master node, the other two will be used as nodes. I am using static IPs for them so that it is easier to refer to them from the Ansible playbook and to require DNS configuration.

The machines can finally be provisioned with vagrant as:

# vagrant up --provider libvirt

At this point you should be able to login into the VMs as root using your ssh key:

for host in 10.0.0.{10,11,12};
    ssh -q -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no [email protected]$host "echo yes I could login on $host"

yes I could login on
yes I could login on
yes I could login on

Our VMs are ready. Let’ install OpenShift!

This is the inventory file used for openshift-ansible, store it in a file origin.inventory:

# Create an OSEv3 group that contains the masters and nodes groups

# Set variables common for all OSEv3 hosts


#######SYSTEM CONTAINERS###################################

# enable htpasswd auth
openshift_master_identity_providers=[{'name': 'htpasswd_auth', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider', 'filename': '/etc/origin/master/htpasswd'}]
openshift_master_htpasswd_users={'admin': '$apr1$zgSjCrLt$1KSuj66CggeWSv.D.BXOA1', 'user': '$apr1$.gw8w9i1$ln9bfTRiD6OwuNTG5LvW50'}

# host group for masters
[masters] openshift_hostname=

# host group for etcd, should run on a node that is not schedulable
[etcd] openshift_ip=

# host group for worker nodes, we list master node here so that
# openshift-sdn gets installed. We mark the master node as not
# schedulable.
[nodes] openshift_hostname= openshift_schedulable=true openshift_node_labels="{'region': 'primary', 'router':'true'}" openshift_hostname= openshift_schedulable=true openshift_node_labels="{'region': 'primary', 'registry':'true'}"

The new configuration required to run system containers is quite visible in the inventory file. `use_system_containers=True` is required to tell the installer to use system containers, `system_images_registry` specifies the registry from where the system containers must be pulled.

And we can finally run the installer, using python3, from the directory where we forked ansible-openshift:

$ ansible-playbook -e 'ansible_python_interpreter=/usr/bin/python3' -v -i origin.inventory ./playbooks/byo/config.yml

After some time, if everything went well, OpenShift should be installed.

To copy the oc client to the local machine I’ve used this command from the directory with the Vagrantfile:

$ scp -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i .vagrant/machines/master/libvirt/private_key [email protected]:/usr/local/bin/oc /usr/local/bin/

As non root, let’s login into the cluster:

$ oc login --insecure-skip-tls-verify=false  -u user -p OriginUser

Login successful.

You don't have any projects. You can try to create a new project, by running

    oc new-project 

$ oc new-project test

Now using project "test" on server "".

You can add applications to this project with the 'new-app' command. For example, try:

    oc new-app centos/ruby-22-centos7~

to build a new example application in Ruby.

$ oc new-app

--> Found Docker image 1f8ec11 (6 days old) from Docker Hub for "fedora"

    * An image stream will be created as "fedora:latest" that will track the source image
    * A Docker build using source code from will be created
      * The resulting image will be pushed to image stream "hello-openshift-plus:latest"
      * Every time "fedora:latest" changes a new build will be triggered
    * This image will be deployed in deployment config "hello-openshift-plus"
    * Ports 8080, 8888 will be load balanced by service "hello-openshift-plus"
      * Other containers can access this service through the hostname "hello-openshift-plus"
    * WARNING: Image "fedora" runs as the 'root' user which may not be permitted by your cluster administrator

--> Creating resources ...
    imagestream "fedora" created
    imagestream "hello-openshift-plus" created
    buildconfig "hello-openshift-plus" created
    deploymentconfig "hello-openshift-plus" created
    service "hello-openshift-plus" created
--> Success
    Build scheduled, use 'oc logs -f bc/hello-openshift-plus' to track its progress.
    Run 'oc status' to view your app.

After some time, we can see our service running:

oc get service

NAME                   CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
hello-openshift-plus           8080/TCP,8888/TCP   46m

Are we really running on system containers? Let’s check it out on master and one node:

(The atomic command upstream has a breaking change so with future versions of atomic we will need -f backend=ostree to filter system containers, as clearly ostree is not a runtime)

for host in 10.0.0.{10,11};
    ssh -q -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no [email protected]$host "atomic containers list --no-trunc -f runtime=ostree"

   CONTAINER ID  IMAGE                                     COMMAND                                    CREATED          STATE     RUNTIME   
   etcd              /usr/bin/ /usr/bin/etcd         2017-02-23 11:01 running   ostree    
   origin-master /usr/local/bin/ 2017-02-23 11:10 running   ostree    
   CONTAINER ID IMAGE                                          COMMAND                                    CREATED          STATE     RUNTIME   
   origin-node        /usr/local/bin/ 2017-02-23 11:17 running   ostree    
   openvswitch /usr/local/bin/ 2017-02-23 11:18 running   ostree

And to finally destroy the cluster:

vagrant destroy

Facebook detox?Ā 

I have been using Facebook for the last years to fill every dead time:waiting for the bus, ads on TV, compiling, etc.  The quality of the information coming from Facebook is inferior to any other social network, at least to my experience (it can be I follow/know the wrong people), though the part of the brain that controls procrastination seems addicted to this lower quality information and the chattering there.  Also, I don’t want to simply delete my Facebook account and move on, most of the people I know are present only there, neither I want to be more “asocial”.

The Android market has always a solution.    An app let you define rules on how long are you permitted to use each app.  I am self limiting myself to ten minutes per day of Facebook.  Second day and the rule is still in place without exceptions! 

use bubblewrap as an unprivileged user to run systemd images

bubblewrap is a sandboxing tool that allows unprivileged users to run containers. I was recently working on a way to allow unprivileged users, to take advantage of bubblewrap to run regular system images that are using systemd. To do so, it was necessary to modify bubblewrap to keep some capabilities in the sandbox.

Capabilities are the way, since Linux 2.2, that the kernel uses to split the root power into a finer grained set of permissions that each thread can have. Together with Linux namespaces it is fine to leave unprivileged users the possibility to use some of them. To give an example, CAP_SETUID, which allows the calling process to make manipulations of process UIDs, is fine to be used in a new user namespace as the set of permitted UIDs is restricted to those UIDs that exist in the new user namespace.

The changes required in bubblewrap are not yet merged upstream. In the rest of post I will refer to the modified bubblewrap simply as bubblewrap.

The patches for bubblewrap are available here:, this is the version used for the test. There is already a pull request for these changes to get merged in.

The set of capabilities that bubblewrap leaves in the process is regulated with –cap-add, new namespaces are required to use these caps. The special value ALL, adds all the caps that are allowed by bubblewrap.

A development version of systemd is required to run in the modified bubblewrap. There are patches in systemd upstream that allows systemd to run without requiring CAP_AUDIT_* and to not fail when setgroups is disabled, as it is the case when running inside bubblewrap (to address CVE-2014-8989). The setgroups restriction may be lifted in future in some cases, this is still under discussion.

For my tests, I’ve used Docker to compose the container, in the following Dockerfile there are no metadata directives as anyway they are not used when exporting the rootfs.

FROM fedora
RUN dnf -y install httpd; dnf clean all; systemctl enable httpd.service

To compose the container and export its content to a directory rootfs, you can do as root:

docker build -t httpd .
docker start --name=httpd httpd
mkdir rootfs
cd rootfs
docker export httpd | tar xf -

To install the latest systemd, once you’ve cloned its repository, from the source directory you can simply do:

make -j $(nproc)
make install DESTDIR=$(rootfs)

to install it in the container rootfs.

If the files /etc/subuid and /etc/subgid are present, the first interval of additional UIDs and GIDs for the unprivileged user invoking bubblewrap is used to set the additional users and groups available in the container. This is required for the system users needed for systemd.

At this point, everything is in place and we can use bubblewrap to run the new container as an unprivileged user:

bwrap --uid 0 --gid 0 --bind rootfs / --sys /sys  --proc /proc --dev /dev --ro-bind /sys/fs/cgroup /sys/fs/cgroup --bind /sys/fs/cgroup/systemd /sys/fs/cgroup/systemd --ro-bind /sys/fs/cgroup/cpuset /sys/fs/cgroup/cpuset --ro-bind /sys/fs/cgroup/hugetlb /sys/fs/cgroup/hugetlb --ro-bind /sys/fs/cgroup/devices /sys/fs/cgroup/devices --ro-bind /sys/fs/cgroup/cpu,cpuacct /sys/fs/cgroup/cpu,cpuacct --ro-bind /sys/fs/cgroup/freezer /sys/fs/cgroup/freezer --ro-bind /sys/fs/cgroup/pids /sys/fs/cgroup/pids --ro-bind /sys/fs/cgroup/blkio /sys/fs/cgroup/blkio --ro-bind /sys/fs/cgroup/net_cls,net_prio /sys/fs/cgroup/net_cls,net_prio --ro-bind /sys/fs/cgroup/perf_event /sys/fs/cgroup/perf_event --ro-bind /sys/fs/cgroup/memory /sys/fs/cgroup/memory --bind /sys/fs/cgroup/systemd /sys/fs/cgroup/systemd  --tmpfs /dev/shm --mqueue /dev/mqueue --dev-bind /dev/tty /dev/tty --chdir / --setenv PATH /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin --setenv TERM xterm --setenv container docker  --tmpfs /var --tmpfs /run --tmpfs /tmp --tmpfs /var/www/html --tmpfs /var/log/httpd --bind rootfs/etc /etc  --hostname systemd --unshare-pid --unshare-net --unshare-ipc --unshare-user --unshare-uts --remount-ro / --cap-add ALL --no-reaper /usr/lib/systemd/systemd --system

systemd uses the signal SIGRTMIN+3 to terminate its execution, to kill the bubblewrap container, you can use kill -37 $PID, where $PID is the systemd process in the container.