# the journey to speed up running OCI containers

When I've started working on crun, I was looking at a faster way to start up and stop containers by improving the OCI runtime, the component in the OCI stack that is responsible to ultimately talk to the kernel and setting the environment where the container runs.

The OCI runtime runs for a very limited time, and its job consists mostly in executing a series of syscalls that map directly to the OCI configuration file.

I was surprised to find out that such trivial task could relatively take such a long time.

DISCLAIMER: for my tests I've used the default kernels available in the Fedora installation as well as all the libraries. In addition to the fixes described in this blog post, there could have been other ones during these years that might have affected the overall performance.

The version of crun used for all the tests below is the same.

For all the tests, I've used hyperfine, it was installed through cargo.

# How things were in 2017

To check how far we went from what we had in the past, we'd need to time travel back to 2017, or just install an old Fedora image. For the tests below I've used Fedora 24, that is based on the Linux kernel 4.5.5.

On a freshly installed Fedora 24 with crun built from the main branch:

# hyperfine 'crun run foo'
Benchmark 1: 'crun run foo'
Time (mean ± σ):     159.2 ms ±  21.8 ms    [User: 43.0 ms, System: 16.3 ms]
Range (min … max):    73.9 ms … 194.9 ms    39 runs


160ms is a lot, and to the best of my memory, that is similar to what I was observing five years ago.

Profiling the OCI runtime immediately showed that most of the user time was spent by libseccomp to compile the seccomp filter.

To verify that, let's try running a container with the same configuration but without the seccomp profile:

# hyperfine 'crun run foo'
Benchmark 1: 'crun run foo'
Time (mean ± σ):     139.6 ms ±  20.8 ms    [User: 4.1 ms, System: 22.0 ms]
Range (min … max):    61.8 ms … 177.0 ms    47 runs


We use 1/10th of the user time we needed before, and the overall time also improved!

So there are mainly two different problems: 1) system time is quite high, 2) and user time is dominated by libseccomp. We need to tackle both of them.

Let's concentrate on the system time for now, we will get back to seccomp later.

There very few culprits responsible for most of the time wasted in the kernel.

# System Time

## Create and destroy a network namespace

Creating and destroying a network namespace used to be very expensive, the issue can be reproduced just by using the unshare tool, and on Fedora 24 I get:

# hyperfine 'unshare -n true'
Benchmark 1: 'unshare -n true'
Time (mean ± σ):      47.7 ms ±  51.4 ms    [User: 0.6 ms, System: 3.2 ms]
Range (min … max):     0.0 ms … 190.5 ms    365 runs


That is a lot of time!

I've attempted to fix it in the kernel and suggest a patch. Florian Westphal rewrote it as a series in a much better way and it was merged in the Linux kernel:

commit 8c873e2199700c2de7dbd5eedb9d90d5f109462b
Author: Florian Westphal <[email protected]>
Date:   Fri Dec 1 00:21:04 2017 +0100

netfilter: core: free hooks with call_rcu

Giuseppe Scrivano says:
"SELinux, if enabled, registers for each new network namespace 6
netfilter hooks."

Cost for this is high.  With synchronize_net() removed:
"The net benefit on an SMP machine with two cores is that creating a
new network namespace takes -40% of the original time."

This patch replaces synchronize_net+kvfree with call_rcu().
We store rcu_head at the tail of a structure that has no fixed layout,
i.e. we cannot use offsetof() to compute the start of the original
allocation.  Thus store this information right after the rcu head.

We could simplify this by just placing the rcu_head at the start
of struct nf_hook_entries.  However, this structure is used in
packet processing hotpath, so only place what is needed for that
at the beginning of the struct.

Reported-by: Giuseppe Scrivano <[email protected]>
Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

commit 26888dfd7e7454686b8d3ea9ba5045d5f236e4d7
Author: Florian Westphal <[email protected]>
Date:   Fri Dec 1 00:21:03 2017 +0100

netfilter: core: remove synchronize_net call if nfqueue is used

since commit 960632ece6949b ("netfilter: convert hook list to an array")
nfqueue no longer stores a pointer to the hook that caused the packet
to be queued.  Therefore no extra synchronize_net() call is needed after
dropping the packets enqueued by the old rule blob.

Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>

Author: Florian Westphal <[email protected]>
Date:   Fri Dec 1 00:21:02 2017 +0100

netfilter: core: make nf_unregister_net_hooks simple wrapper again

("netfilter: core: batch nf_unregister_net_hooks synchronize_net calls").

Nothing wrong with it.  However, followup patch will delay freeing of hooks
with call_rcu, so all synchronize_net() calls become obsolete and there
is no need anymore for this batching.

This revert causes a temporary performance degradation when destroying
network namespace, but its resolved with the upcoming call_rcu conversion.

Signed-off-by: Florian Westphal <[email protected]>
Signed-off-by: Pablo Neira Ayuso <[email protected]>


These patches make a huge difference, the time to create and destroy a network namespace now is the following one a modern 5.19.15 kernel dropped to a ridiculous amount:

# hyperfine 'unshare -n true'
Benchmark 1: 'unshare -n true'
Time (mean ± σ):       1.5 ms ±   0.5 ms    [User: 0.3 ms, System: 1.3 ms]
Range (min … max):     0.8 ms …   6.7 ms    1907 runs


## Mounting mqueue

Mounting mqueue was also a relatively expensive operation.

On Fedora 24, it used to be like:

# mkdir /tmp/mqueue; hyperfine 'unshare --propagation=private -m mount -t mqueue mqueue /tmp/mqueue'; rmdir /tmp/mqueue
Benchmark 1: 'unshare --propagation=private -m mount -t mqueue mqueue /tmp/mqueue'
Time (mean ± σ):      16.8 ms ±   3.1 ms    [User: 2.6 ms, System: 5.0 ms]
Range (min … max):     9.3 ms …  26.8 ms    261 runs


In this case as well I've tried to fix it and propose a patch. It was not accepted, but Al Viro came up with a better version to fix the issue:

commit 36735a6a2b5e042db1af956ce4bcc13f3ff99e21
Author: Al Viro <[email protected]>
Date:   Mon Dec 25 19:43:35 2017 -0500

mqueue: switch to on-demand creation of internal mount

Instead of doing that upon each ipcns creation, we do that the first
time mq_open(2) or mqueue mount is done in an ipcns.  What's more,
doing that allows to get rid of mount_ns() use - we can go with
considerably cheaper mount_nodev(), avoiding the loop over all
mqueue superblock instances; ipcns->mq_mnt is used to locate preexisting
instance in O(1) time instead of O(instances) mount_ns() would've
cost us.

Based upon the version by Giuseppe Scrivano <[email protected]>; I've
that area) and added a switch to mount_nodev().

Signed-off-by: Al Viro <[email protected]>


after this patch, the cost to create a mqueue mount dropped as well:

# mkdir /tmp/mqueue; hyperfine 'unshare --propagation=private -m mount -t mqueue mqueue /tmp/mqueue'; rmdir /tmp/mqueue
Benchmark 1: 'unshare --propagation=private -m mount -t mqueue mqueue /tmp/mqueue'
Time (mean ± σ):       0.7 ms ±   0.5 ms    [User: 0.5 ms, System: 0.6 ms]
Range (min … max):     0.0 ms …   3.1 ms    772 runs


## Create and destroy an IPC namespace

I've procrastined the containers startup time for a couple of years, and got back at it in the beginning of 2020. Another issue I was aware of was the time to create and destroy an IPC namespace.

As for the network namespace, the issue can be reproduced using only the unshare tool:

# hyperfine 'unshare -i true'
Benchmark 1: 'unshare -i true'
Time (mean ± σ):      10.9 ms ±   2.1 ms    [User: 0.5 ms, System: 1.0 ms]
Range (min … max):     4.2 ms …  17.2 ms    310 runs


Differently that the previous two attempts, this time the version I've sent was accepted upstream:

commit e1eb26fa62d04ec0955432be1aa8722a97cb52e7
Author: Giuseppe Scrivano <[email protected]>
Date:   Sun Jun 7 21:40:10 2020 -0700

ipc/namespace.c: use a work queue to free_ipc

the reason is to avoid a delay caused by the synchronize_rcu() call in
kern_umount() when the mqueue mount is freed.

the code:

#define _GNU_SOURCE
#include <sched.h>
#include <error.h>
#include <errno.h>
#include <stdlib.h>

int main()
{
int i;

for (i = 0; i < 1000; i++)
if (unshare(CLONE_NEWIPC) < 0)
error(EXIT_FAILURE, errno, "unshare");
}

goes from

Command being timed: "./ipc-namespace"
User time (seconds): 0.00
System time (seconds): 0.06
Percent of CPU this job got: 0%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.05

to

Command being timed: "./ipc-namespace"
User time (seconds): 0.00
System time (seconds): 0.02
Percent of CPU this job got: 96%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.03

Signed-off-by: Giuseppe Scrivano <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Paul E. McKenney <[email protected]>
Reviewed-by: Waiman Long <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Manfred Spraul <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>


with this patch in place, the time to create and destroy an IPC has reduced significantly as outlined in the commit message as well, on a modern 5.19.15 kernel I get now:

# hyperfine 'unshare -i true'
Benchmark 1: 'unshare -i true'
Time (mean ± σ):       0.1 ms ±   0.2 ms    [User: 0.2 ms, System: 0.4 ms]
Range (min … max):     0.0 ms …   1.5 ms    1966 runs


# User Time

Kernel time seems under control now. What can we do to reduce the user time?

As we've already found out earlier, libseccomp is the main culprit here, so we need to tackle it first, and it happened just after the fix for IPC in the kernel.

Most of the cost with libseccomp is due to the syscall lookup code. The OCI configuration file contains a list of syscalls by their name, each of these syscalls is looked up through the seccomp_syscall_resolve_name function call, that returns the syscall number given the syscall name.

libseccomp used to perform a linear search through the syscall table for each syscall name, e.g. for x86_64 it looked like this:

/* NOTE: based on Linux v5.4-rc4 */
const struct arch_syscall_def x86_64_syscall_table[] = { \
{ "_llseek", __PNR__llseek },
{ "_newselect", __PNR__newselect },
{ "_sysctl", 156 },
{ "accept", 43 },
{ "accept4", 288 },
{ "access", 21 },
{ "acct", 163 },
.....
};

int x86_64_syscall_resolve_name(const char *name)
{
unsigned int iter;
const struct arch_syscall_def *table = x86_64_syscall_table;

/* XXX - plenty of room for future improvement here */
for (iter = 0; table[iter].name != NULL; iter++) {
if (strcmp(name, table[iter].name) == 0)
return table[iter].num;
}

return __NR_SCMP_ERROR;
}


building up the seccomp profile through libseccomp had a complexity of O(n*m), where n is the number of syscalls in the profile and m the number of the syscalls known to libseccomp.

I followed the advise in the code comment and spent some time trying to fix it up. In January 2020, I've worked on a patch for libseccomp to solve the issue using a perfect hash function to lookup syscall names.

The patch for libseccomp is this one:

commit 9b129c41ac1f43d373742697aa2faf6040b9dfab
Author: Giuseppe Scrivano <[email protected]>
Date:   Thu Jan 23 17:01:39 2020 +0100

arch: use gperf to generate a perfact hash to lookup syscall names

This patch significantly improves the performance of
seccomp_syscall_resolve_name since it replaces the expensive strcmp
for each syscall in the database, with a lookup table.

The complexity for syscall_resolve_num is not changed and it
uses the linear search, that is anyway less expensive than
seccomp_syscall_resolve_name as it uses an index for comparison
instead of doing a string comparison.

On my machine, calling 1000 seccomp_syscall_resolve_name_arch and
seccomp_syscall_resolve_num_arch over the entire syscalls DB passed
from ~0.45 sec to ~0.06s.

changes, some substantial, the highlights include:
* various style tweaks
* .gitignore fixes
* fixed subject line, tweaked the description
* dropped the arch-syscall-validate changes as they were masking
other problems
* extracted the syscalls.csv and file deletions to other patches
to keep this one more focused
* fixed the x86, x32, arm, all the MIPS ABIs, s390, and s390x ABIs as
the syscall offsets were not properly incorporated into this change
* cleaned up the ABI specific headers
* cleaned up generate_syscalls_perf.sh and renamed to
arch-gperf-generate
* fixed problems with automake's file packaging

Signed-off-by: Giuseppe Scrivano <[email protected]>
Reviewed-by: Tom Hromatka <[email protected]>
[PM: see notes in the "PM" section above]
Signed-off-by: Paul Moore <[email protected]>


That patch has been merged and released, now building the seccomp profile has a complexity O(n) with n the number of syscalls in the profile.

The improvement is significant, with a new enough libseccomp:

# hyperfine 'crun run foo'
Benchmark 1: 'crun run foo'
Time (mean ± σ):      28.9 ms ±   5.9 ms    [User: 16.7 ms, System: 4.5 ms]
Range (min … max):    19.1 ms …  41.6 ms    73 runs


The user time is just 16.7ms. It used to be more than 40ms before, and around 4ms when seccomp is not used at all.

So using 4.1ms as the user time cost without seccomp, we have:

time_used_by_seccomp_before = 43.0ms - 4.1ms = 38.9ms time_used_by_seccomp_after = 16.7ms - 4.1ms = 12.6ms

more than 3x faster! And the syscalls lookup is only a part of what libseccomp does, another considerable amount of time is spent to compile the BPF filter.

# BPF filter compilation

Can we do even better than that?

The BPF filter compilation is done by the seccomp_export_bpf function and it is still quite expensive.

One simple observation is that most containers are reusing the same seccomp profile over and over, with little customizations happening.

So it'd make sense to cache the result of the compilation, and reuse it when possible.

There is a new crun feature to cache the result of the BPF filter compilation. The patch is not merged at the time of writing this, although it is almost on the finish line.

With that in place the cost of compiling the seccomp profile is paid only when the generated BPF filter is not in the cache and this is what we have now:

# hyperfine 'crun-from-the-future run foo'
Benchmark 1: 'crun-from-the-future run foo'
Time (mean ± σ):       5.6 ms ±   3.0 ms    [User: 1.0 ms, System: 4.5 ms]
Range (min … max):     4.2 ms …  26.8 ms    101 runs


# Conclusion

Over 5 years, the total time needed to create and destroy a OCI container has passed from almost 160ms to a little bit more than 5ms.

That is almost a 30x improvement!