why do I have two /sys/fs/cgroup in my container


it happened a few times in the past that users wonder why they see two /sys/fs/cgroup mounts in their unprivileged container.

When working with unprivileged containers in Podman, users often notice two /sys/fs/cgroup mounts if the container is not using a new network namespace.

The Limitation of Unprivileged Users

An unprivileged user, by definition, lacks certain permissions that are available to the root user. One of these limitations is the inability to mount a fresh /sys filesystem within a new user namespace, unless there is already a /sys filesystem mounted and accessible in the current namespace, and that the user namespace also owns the current network namespace.

When such conditions are not met, Podman uses a bind mount from the /sys filesystem of the host to provide the container with a /sys filesystem.

Cross-Namespace Bind Mounts

A consequence of a bind mount that crosses two user namespaces is the kernel automatically ‘locking’ the new mount, treating it as a single entity. This has the effect of preventing the inner container from unmounting the /sys/fs/cgroup mount, as it is considered part of the /sys mount itself.

New cgroup mount

The /sys/fs/cgroup mount, embedded within the /sys mount, refers to the host environment's cgroup mount. A fresh /sys/fs/cgroup mount is needed for the container, which is then mounted on top of the existing embedded mount.

The consequence of this approach is the appearance of two /sys/fs/cgroup mounts within the container, as it can seen in the following example:

$ podman run --rm -ti --user podman quay.io/podman/stable podman run --rm --network=host fedora findmnt -R /sys
Resolved "fedora" as an alias (/etc/containers/registries.conf.d/000-shortnames.conf)
Trying to pull registry.fedoraproject.org/fedora:latest...
Getting image source signatures
Copying blob 718a00fe3212 done   | 
Copying config 368a084ba1 done   | 
Writing manifest to image destination
TARGET                            SOURCE  FSTYPE  OPTIONS
/sys                              sysfs   sysfs   ro,nosuid,nodev,noexec,relatime,seclabel
|-/sys/fs/cgroup                  cgroup2 cgroup2 ro,nosuid,nodev,noexec,relatime,seclabel,nsdelegate,memory_recursiveprot
| `-/sys/fs/cgroup                cgroup2 cgroup2 ro,nosuid,nodev,noexec,relatime,seclabel,nsdelegate,memory_recursiveprot
|-/sys/firmware                   tmpfs   tmpfs   ro,relatime,context="system_u:object_r:container_file_t:s0:c210,c329",size=0k,uid=1000,gid=1000,inode64
| `-/sys/firmware                 tmpfs   tmpfs   ro,relatime,seclabel,size=0k,uid=100999,gid=100999,inode64
|-/sys/fs/selinux                 tmpfs   tmpfs   ro,relatime,context="system_u:object_r:container_file_t:s0:c210,c329",size=0k,uid=1000,gid=1000,inode64
| `-/sys/fs/selinux               tmpfs   tmpfs   ro,relatime,seclabel,size=0k,uid=100999,gid=100999,inode64
|-/sys/dev/block                  tmpfs   tmpfs   ro,relatime,context="system_u:object_r:container_file_t:s0:c210,c329",size=0k,uid=1000,gid=1000,inode64
| `-/sys/dev/block                tmpfs   tmpfs   ro,relatime,seclabel,size=0k,uid=100999,gid=100999,inode64
|-/sys/devices/virtual/powercap   tmpfs   tmpfs   ro,relatime,context="system_u:object_r:container_file_t:s0:c210,c329",size=0k,uid=1000,gid=1000,inode64
| `-/sys/devices/virtual/powercap tmpfs   tmpfs   ro,relatime,seclabel,size=0k,uid=100999,gid=100999,inode64
`-/sys/kernel                     tmpfs   tmpfs   ro,relatime,seclabel,size=0k,uid=100999,gid=100999,inode64