composefs - a file system for container images

For the last couple of weeks, I've been playing on PoC implementation of a file system for the Linux kernel.

The work I was doing on zstd:chunked is already merged in the upstream tools and usable, although it is still an experimental feature that must be manually enabled and images must be built with the new format to be usable.

The zstd:chunked format is based on the same idea as stargz, but it uses the zstd compression instead of gzip. I'll skip the details as it deserves a blog post on its own, but the overall idea is to move from a per-layer deduplication to a per-file deduplication in the containers storage.

What happens today when you run podman pull $IMAGE or docker image $IMAGE is the container engine will fetch the image metadata. The image metadata contains a list of OCI layers. The container engine pulls the layers that are not already present locally. For any missing layer, the container engine needs to download a tarball and extract it locally.

The current model makes perfect sense with the overlay file system since deduplication happens at a layer level. One layer is either present or missing, and when it is not present, it must be pulled from the registry and extracted locally.

The overlay backend became the de-facto standard for container images because it fits naturally with the OCI container images format.

The existing model shows its limits once we move to a per-file deduplication model.

A directory checkout must still be fully created to fit into the existing “overlay” scheme. So no matter how many files are already present locally, we still need to recreate the directory checkout as overlay expects it, which is how it is stored in the layer OCI tarballs.

Data deduplication doesn't fit well, and the container engine must handle it at a different level:

some file systems, such as BTRFS and XFS, support reflinks that allow sharing data among different inodes.
using the venerable hard links that are supported on every POSIX file system.

On a related note, reading the release notes for Coreutils 9.0 released last month, I was reminded of the cp: accept the –reflink option patch that I've added more than 12 years ago! I've been obsessed with such things for quite a while! The new Coreutils version switched the default to use --reflink where supported. You are likely already using reflinks each time you cp a file without even realizing it.

Each of these solutions has their set of problems:

hard links:
- Since they share an inode, deduplication can happen only when the file content is the same (obviously!) as well as the entire inode metadata, so stuff like UID, GID, mode, and extended attributes must all be the same before a file can be deduplicated in a transparent way.
- They can never be really transparent since the number of hard links are reflected into the st_nlink attribute for an inode.
reflinks:
- not all file systems support them.
- they are effectively different inodes, so while the data is deduplicated on disk, when the files are used, even if reflinked, they are loaded in memory multiple times.

How can this be improved?

My proposal is to create a new file system for the Linux kernel that creates an overlay of files instead of directories; the file system basically composes a view of the files that are present in the read-only container image: composefs.

When working with directories, it is reasonable to expect just a few of them and that their paths can fit in a memory page (4096 bytes) so they can be passed down to the kernel as the mount options, but that is not possible when working with individual files. Some sort of metadata file must be passed down to the kernel.

For my PoC, I've used a very simple format to describe a file system view:

#define LCFS_VERSION 1

#define LCFS_USE_TIMESPEC 0

typedef u32 lcfs_off_t;

typedef lcfs_off_t lcfs_c_str_t;

struct lcfs_vdata_s {
	lcfs_off_t off;
	lcfs_off_t len;
} __attribute__((packed));

struct lcfs_header_s {
	u8 version;
	u8 unused1;
	u16 unused2;
	u32 unused3;
} __attribute__((packed));

struct lcfs_inode_data_s {
	u32 st_mode; /* File type and mode.  */
	u32 st_nlink; /* Number of hard links.  */
	u32 st_uid; /* User ID of owner.  */
	u32 st_gid; /* Group ID of owner.  */
	u32 st_rdev; /* Device ID (if special file).  */
} __attribute__((packed));

struct lcfs_inode_s {
	/* Index of struct lcfs_inode_data_s. */
	lcfs_off_t inode_data_index;

	/* stat data.  */
	union {
		/* Offset and length to the content of the directory.  */
		struct {
			lcfs_off_t off;
			lcfs_off_t len;
		} dir;

		struct {
			/* Total size, in bytes.  */
			u64 st_size;
			lcfs_c_str_t payload;
		} file;
	} u;

#if LCFS_USE_TIMESPEC
	struct timespec st_mtim; /* Time of last modification.  */
	struct timespec st_ctim; /* Time of last status change.  */
#else
	u64 st_mtim; /* Time of last modification.  */
	u64 st_ctim; /* Time of last modification.  */
#endif

	/* Variable len data.  */
	struct lcfs_vdata_s xattrs;
} __attribute__((packed));

struct lcfs_dentry_s {
	/* Index of struct lcfs_inode_s */
	lcfs_off_t inode_index;

	/* Variable len data.  */
	lcfs_c_str_t name;

} __attribute__((packed));

/* xattr representation.  */
struct lcfs_xattr_header_s {
	struct lcfs_vdata_s key;
	struct lcfs_vdata_s value;
} __attribute__((packed));

The format is still under development, but so far, it seems enough to work with container images.

The idea is that the container engine creates a description of how the file system must look like then pass down this blob to the kernel. The payload for files is not stored as part of the file system description blob but instead, a path to a file that contains the real payload is provided.

The payload mechanism is closer to how symlinks work; for such reason, composefs supports deduplication across several file systems, something that is not possible neither with reflinks nor hard links.

Memory page sharing works across several images as long as they use the same files for the payload.

A directory checkout must not be created. This saves inodes on the underlying file system, and it makes containers start up faster.

One negative note is that there will be more inodes in memory in the kernel since now we have an extra layer (storage file system / composefs / overlay for the writeable layer). Perhaps this can be solved by combining somehow composefs and overlay? Or making composefs part of the overlay mount itself?

Future plans

Assuming the idea is accepted upstream, the long-term plan is to move to a containers storage model that looks more like OSTree where each file, even if in different commits, it is stored only once in the repository.

There are several advantages, for example, when pulling an image, the container engine can easily look up what files are already present locally with a simple access(2) syscall (e.g., looking up if the $REPOSITORY/CHECKSUM[0:2]/CHECKSUM[2:] file exists) and creating the missing ones. At mount time, creating the description blob is also trivial since the payload path is known without any access to the file system.

In the best-case scenario that all the files for an image are already present in the local storage, pulling and extracting an image can be done in a few milliseconds instead of several seconds or minutes as it happens today.