run containers without pulling images

CRFS is a Google project that aims at running a container without pre-pulling the image first.
The idea is quite smart: an OCI layer (that is basically a compressed tarball), is modified in a way that it is possible to seek content inside of it and access a single file.
It is designed around the stargz (Seekable tar.gz) format. Instead of having a single compressed tar stream, the stargz modifies it to concatenate the gzipped stream of each file. Old clients are still able to handle the stargz’ipped stream as a regular .tar.gz file.

In an attempt to support CRFS with fuse-overlayfs, I’ve worked on adding a plugin system to fuse-overlayfs (https://github.com/containers/fuse-overlayfs/pull/119). It will make possible to extend it and support different ways to retrieve data from the lower layers.

The second step is a plugin that can handle CRFS, it is still a PoC but seems to work quite nicely: https://github.com/giuseppe/crfs-plugin

To create a stargz image, you’d need to use stargzify

# go get -u github.com/google/crfs/stargz/stargzify

Once stargzify is installed, an image can be converted as:

# stargzify docker.io/fedora docker.io/gscrivano/test:stargz
2019/10/24 20:33:33 pushed blob: sha256:c7155ae298b145d79e75c396ab5cb917023c4fd8b9cf8c7ff2f0332b41ef8651
2019/10/24 20:33:34 pushed blob: sha256:5a419d36bce538fa32fc21cbe11134ccbd70597379d9320f3a32eb6be78e4ad5
2019/10/24 20:33:35 docker.io/gscrivano/test:stargz: digest: sha256:ca6723c15c5b3b0947deef12048ee64126ed237e112cfbde300ce0f4066a4b4d size: 428

The image was pushed to the registry. Let’s create a container:

# mkdir lower upper workdir merged
# export DATA=$(echo -n docker://docker.io/gscrivano/test:stargz | base64 -w0)
# fuse-overlayfs -o fast_ino=1,plugins=/path/to/crfs-plugin.so,lowerdir=//crfs/$DATA/lower,upperdir=upper,workdir=work merged

The image, passed to fuse-overlayfs encoded in base 64, is mounted at the merged directory.

# ls merged/
bin   dev  home   lib    lost+found  mnt  proc  run   srv   sys  usr
boot  etc  hosts  lib64  media       opt  root  sbin  tmp  var

To run the container, we can take advantage of the Podman –rootfs feature. It tells Podman to not manage the storage for the container, but to use the specified path as its rootfs.

# podman run --rm -ti --rootfs merged /bin/sh
sh-5.0#

Now we are in a container where files from the lower layers will be loaded on demand when requested.

rootless resources management with Podman on Fedora 30

I’ve finally opened some PRs for conmon and libpod that enable resources management for Podman rootless containers on Fedora 30 when using crun:

The only change for the default Fedora 30 configuration is to enable the cgroup v2 unified hierarchy. It can be done with:

# grubby --update-kernel=ALL --args="systemd.unified_cgroup_hierarchy=1"

and a reboot.

systemd by default enables only the pids and memory controllers for unprivileged users. If you want to enable more controllers, you need a drop-in configuration file under /etc/systemd/system/[email protected], that looks like:

[Service]
Delegate=cpu cpuacct io blkio memory devices pids

I’ve not found a way to enable the cpuset controller using only the systemd configuration. It must be done manually, or by providing a service file that writes directly to /sys/fs/cgroup/cgroup.subtree_control and /sys/fs/cgroup/user.slice/cgroup.subtree_control, and then make sure this setting is propagated to [email protected].

With the updated versions of crun, Podman and conmon:

$ podman --runtime /usr/local/bin/crun run  --memory=100M --rm -ti fedora bash
# cat /proc/self/cgroup 
0::/user.slice/user-1000.slice/[email protected]/80adb7152d9f299cb7bfd383aa7ae2543534d7925c96d486f046e185d09d0946-39898.scope
# cat /sys/fs/cgroup//user.slice/user-1000.slice/[email protected]/80adb7152d9f299cb7bfd383aa7ae2543534d7925c96d486f046e185d09d0946-39898.scope/memory.max
104857600

resources management with rootless containers and cgroups v2

cgroups v2 will finally allow unprivileged users to manage a cgroup hierarchy in a safe manner without requiring any additional permission.

systemd is already mounting cgroups v2 under /sys/fs/cgroup/unified since long time, although by default there are no controllers enabled there and everything still works using cgroups v1.

It is also possible to use cgroups v2 only, this is known as the unified model. To enable it, it is necessary to to specify systemd.unified_cgroup_hierarchy=1 on the kernel command line, systemd will.

There is an issue in D-Bus when the user is running inside of a user namespace. The D-Bus request include the geteuid(), but since it is relative to the namespace instead of the user on the host, it won’t match and the request fail. If you are going to play with it and launching the container from within a user namespace, be sure to use this patch: https://github.com/systemd/systemd/pull/11785.

To get it working, I had to manually enable some of the controllers for the unprivileged users, as root:

echo +cpu +cpuset +io +memory +pids > /sys/fs/cgroup/user.slice/cgroup.subtree_control

You’ll need to propagate it down to the hierarchy to the user service slice.

Be sure there are no real-time processes running or the cpu controller cannot be enabled. If you hit any error like error: Invalid argument when you are enabling the cgroups v2 control, you can try to fix it disabling PulseAudio and rtkit-daemon. If it still doesn’t work check if there are other real-time processes running, you can find them with:

ps ax -L -o ‘pid tid cls rtprio comm‘ |grep RR

I’ve added some basic support for cgroups v2 to the crun OCI runtime (https://github.com/giuseppe/crun/pull/11). The implementation is not complete yet but it supports already the cpu, io memory and pids controllers. Other controllers must be implemented through eBPF. The freezer controller is still being worked on in the kernel. In the crun implementation, systemd, when present, is used only for the delegation of the hierarchy, all the configuration happens by writing directly to the cgroups file. This will enable crun to work with cgroups v2 even if

Since the OCI runtime was designed with cgroups v1 in mind, I have tried to convert from the cgroups v1 configuration to cgroups v2. For instance, the blkio.weight is converted linearly from the range 10-1000 to 1-10000 to accomodate what io.weight expects

With that in place, we can now ask systemd to delegate an entire cgroups v2 subtree to the container and manage it directly as an unprivileged user.

Using a OCI configuration that includes:

{
...
    "process": {
        "args": [
            "cat", "/sys/fs/cgroup/memory.max"
        ]
    },
...
    "linux": {
        "resources": {
            "memory": {
                "limit": 1000000000
        }
    }
...
}

We can do as an unprivileged user:

$ crun --systemd run foo
 999997440

Next steps:

  • support cgroups v2 in Podman and conmon. Since the OCI runtime configuration won’t chang, there won’t probably be much to fix here.
  • add support to crun for more controllers using eBPF.

SUID binaries from a user namespace

Additional IDs that are allocated to a user through /etc/subuid and /etc/subgid must be considered as permanently allocated and never reused for any other user.

Even if the container/user namespace where they are used is destroyed, it is possible to forge a SUID binary that will keep access to any ID present in the user namespace.

This simple C program is enough to keep access to an UID that was allocated to a user namespace:

#define _GNU_SOURCE
#include <unistd.h>
#include <sys/types.h>

int main (int argc, char **argv)
{
	uid_t u = geteuid ();
	setresuid (u, u, u);
	execvp (argv[1], argv + 1);
}

with that in place, from the user namespace:

$ id -u # ID 0 is mapped to ID 1000 in the host
0
$ gcc program.c -o keep_id
$ chown 10:10 keep_id
$ chmod +s keep_id

even once the user namespace is destroyed and possibly the range of allocated subids changed for the user, from the host we can still get access to whatever ID was allocated to the user 10 in the user namespace:

$ id -u
1000
$ ls -l keep_id
-rwsr-sr-x. 1 100009 100009 18432 Jan 10 22:23 keep_id
$ ./keep_id id -u
100009


disposable rootless sessions

would be nice to have a way to “fork” the current session and be able to revert all the changes done, without any leftover on the file system.

Playing with fuse-overlayfs, a FUSE implementation of the overlay file system and thus usable by rootless users, I realized how that is so easy to achieve, just by setting the overlay lowerdir to ‘/’ and using a temporary directory for the upper dir.

The upper dir, where all the overlay changes are written can be deleted once the session is over, or re-used to get back the created session.

This simple setup also enables the use case of an unprivileged user that can install packages using the existing system as a base. With few caveats (e.g. /var/log must be writeable) I managed to run dnf and install a few packages on top of my system without the need of the root user. Obviously the rest of the system didn’t notice any change, as these files were visible only from the fuse-overlayfs mount and the mount namespace using it.

Perhaps a tool could help managing similar setups. The biggest problem is in how to address the assumption the lower layer won’t change, or at least not enough to cause any breakage in the layered session.

An Emacs mode for rust

I was looking for an Emacs mode that could help me to hack on rust.

Rust-mode itself has not enough features to help me with a language I am not really proficient with yet.

I wanted to give a try to racer, which is available in the emacs packages list.

The rust toolchain available on Fedora 29 seems not able to build racer, so the first step was to install rust from rustup.rs and pretended it is completely fine to pipe curl into sh.

Once rustup was installed, I then needed the nightly toolchain and the rust-src component so that racer is able to navigate into the Rust source code.

$ curl https://sh.rustup.rs -sSf | sh
$ cargo +nightly install racer
$ rustup component add rust-src


At this point I installed the racer-mode package from Emacs.  To do so, M-xthen list-packages, in the new buffer find racer-mode, then press I to select it and X to install it.  If you don’t find it in the list of the packages, you might need to configure the packages archive to use.  This is what I am using from ~/.emacs:

(when (>= emacs-major-version 24)
  (require 'package)
  (add-to-list
   'package-archives
   '("melpa" . "http://melpa.org/packages/")
   t)
  (package-initialize))

Finally I had to configure in my ~/.emacs file the path for racer to look for source files:

(setq racer-rust-src-path "~/.rustup/toolchains/stable-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/src/")

At this point everything is configured to start using racer-mode.

You can open a .rs file, and set the mode with M-x racer-mode (if you are happy with it, you can configure it for any file with the .rs extension).

For a quick try, using M-. on a stdlib function name should bring you to the definition of the function, M-, to go back to its usage.



rootless podman from upstream on Centos 7

this is the recipe I use to build podman from upstream on Centos 7 and use rootless containers. We need an updated version of the shadow utils as newuidmap and newgidmap are not present on Centos 7. Using make install is not the correct way to install packages, and it will also overwrite existing The shadow utils are installed using “make install” which is not the clean way to install packages and it also overwrite the existing binaries, but it is fine on a development system. Podman is already present on Centos 7 and in facts we install it so we don’t have to worry about conmon and other dependencies.

$ sudo yum install -y golang runc git ostree-devel gpgme-devel device-mapper-devel btrfs-progs-devel libassuan-devel libseccomp-devel automake autoconf gettext-devel libtool libxslt libsemanage-devel bison libcap-devel podman
$ go get -u github.com/containers/libpod/cmd/podman

$ (git clone https://github.com/shadow-maint/shadow; cd shadow; ./autogen.sh --prefix=/usr --enable-man; make && sudo make -C src install)

$ (git clone https://github.com/rootless-containers/slirp4netns.git; cd slirp4netns; ./autogen.sh; ./configure --prefix=/usr; make -j $(nproc); sudo make install)

$ sudo bash -c 'echo 10000 > /proc/sys/user/max_user_namespaces'

$ sudo bash -c "echo $(whoami):110000:65536 > /etc/subuid"

$ sudo bash -c "echo $(whoami):110000:65536 > /etc/subgid"

and:

$ go/bin/podman pull alpine
$ go/bin/podman run --net host --rm -ti alpine echo hello
hello

network namespaces for unprivileged users

a couple of weekends ago I’ve played with libslirp and put together slirp-forwarder.

SliRP emulates in userspace a TCP/IP stack. It can be used to circumvent the limitation of creating TAP/TUN devices in the host namespace for an unprivileged user. The program could run in the host namespace, receive messages from the network namespace where a TAP device is configured, and forward them to the outside world using unprivileged operations such as opening another connection to the destination host. Privileged operations are still not possible outside of the emulated network, as the helper program doesn’t gain any additional privilege that running as an unprivileged user.

Once the PoC was ready, I discovered there was already another tool by Akihiro Suda (@AkihiroSuda), slirp4netns that was doing exactly the same thing, and it was already using the better slirp implementation in QEMU, that is used for configuring unprivileged virtual machines.

slirp4netns was added to the rootlesscontainers github organization, and its repo can be found here: https://github.com/rootless-containers/slirp4netns

With some small changes to slirp4netns, it was possible to integrate slirp4netns into Podman for the configuration of an unprivileged network namespace. For example, we needed a way to terminate the slirp4netns program once the container exits, allow to configure the interface and notify Podman back once the configuration is done.

$ podman run --rm alpine ifconfig -a
lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

tap0      Link encap:Ethernet  HWaddr CE:CE:E1:0A:4B:F9  
          inet addr:10.0.2.100  Bcast:10.0.2.255  Mask:255.255.255.0
          inet6 addr: fe80::ccce:e1ff:fe0a:4bf9/64 Scope:Link
          UP BROADCAST RUNNING  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 B)  TX bytes:90 (90.0 B)

This is how it looks from the host, the arguments to slirp4netns in addition to some fd used for the synchronization, are the PID of a process in the network namespace to configure and the device name.

$ bin/podman run --rm alpine sleep 10 &
[1] 10360
$ pgrep -fa slirp
10460 /usr/bin/slirp4netns -c -e 3 -r 4 10447 tap0