How I decreased the time to create and destroy an OCI container to 5 milliseconds

The journey to speed up running OCI containers took longer than expected, but the effort was worth it.

Posted: November 16, 2022 by Giuseppe Scrivano (Red Hat)

Getting started with Ansible — ^{Photo by Pixabay from Pexels}

When I started working on crun, an Open Container Initiative (OCI)-compliant alternative to runc for Linux container runtimes in 2017, I was looking for a faster way to start and stop containers. I was working to improve the OCI runtime, which is the component in the OCI stack responsible for talking to the kernel and setting the environment where the container runs.

[ Getting started with containers? Check out Deploying containerized applications: A technical overview. ]

The OCI runtime runs for a very limited time, and its job consists mostly of executing a series of syscalls that map directly to the OCI configuration file.

I was surprised to find out that such a trivial task could take such a long time.

DISCLAIMER: I used the default kernels available in the Fedora installation and all the libraries for my tests. In addition to the fixes described in this blog post, there could be others that might affect the overall performance.

I used the same version of crun used for all the tests below.

For benchmarking the tests, I used hyperfine installed through Cargo.

How things were in 2017

To check how far we have come, you'd need to time travel back to 2017 (or just install an old Fedora image). For the tests below, I used Fedora 24, based on the Linux kernel 4.5.5.

On a freshly installed Fedora 24 with crun built from the main branch, I observed these benchmarks:

# hyperfine 'crun run foo'
Benchmark 1: 'crun run foo'
  Time (mean ± σ):     159.2 ms ±  21.8 ms    [User: 43.0 ms, System: 16.3 ms]
  Range (min … max):    73.9 ms … 194.9 ms    39 runs

160ms is a lot, and I recall that is similar to what I observed five years ago.

I profiled the OCI runtime, which immediately showed most of the user time was spent by libseccomp to compile the seccomp filter.

To verify this, I ran a container with the same configuration but without the seccomp profile:

# hyperfine 'crun run foo'
Benchmark 1: 'crun run foo'
  Time (mean ± σ):     139.6 ms ±  20.8 ms    [User: 4.1 ms, System: 22.0 ms]
  Range (min … max):    61.8 ms … 177.0 ms    47 runs

It used a tenth of the user time needed before, and the overall time also improved!

So there are two problems: 1) system time is quite high, and 2) libseccomp dominates user time. I needed to tackle both of them.

Download now

Reducing system time

Very few culprits are responsible for most of the time wasted in the kernel. I'll work on the system time first and return to seccomp later.

Create and destroy a network namespace

Creating and destroying a network namespace used to be very expensive. I can reproduce the issue by using the unshare tool. On Fedora 24, I get the following result:

# hyperfine 'unshare -n true'
Benchmark 1: 'unshare -n true'
  Time (mean ± σ):      47.7 ms ±  51.4 ms    [User: 0.6 ms, System: 3.2 ms]
  Range (min … max):     0.0 ms … 190.5 ms    365 runs

That is a lot of time!

I attempted to fix it in the kernel and suggested a patch. Florian Westphal rewrote it as a series in a much better way, and it was merged into the Linux kernel:

commit 8c873e2199700c2de7dbd5eedb9d90d5f109462b
Author: Florian Westphal <fw@strlen.de>
Date:   Fri Dec 1 00:21:04 2017 +0100

    netfilter: core: free hooks with call_rcu
    
    Giuseppe Scrivano says:
      "SELinux, if enabled, registers for each new network namespace 6
        netfilter hooks."
    
    Cost for this is high.  With synchronize_net() removed:
       "The net benefit on an SMP machine with two cores is that creating a
       new network namespace takes -40% of the original time."
    
    This patch replaces synchronize_net+kvfree with call_rcu().
    We store rcu_head at the tail of a structure that has no fixed layout,
    i.e. we cannot use offsetof() to compute the start of the original
    allocation.  Thus store this information right after the rcu head.
    
    We could simplify this by just placing the rcu_head at the start
    of struct nf_hook_entries.  However, this structure is used in
    packet processing hotpath, so only place what is needed for that
    at the beginning of the struct.
    
    Reported-by: Giuseppe Scrivano <gscrivan@redhat.com>
    Signed-off-by: Florian Westphal <fw@strlen.de>
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

commit 26888dfd7e7454686b8d3ea9ba5045d5f236e4d7
Author: Florian Westphal <fw@strlen.de>
Date:   Fri Dec 1 00:21:03 2017 +0100

    netfilter: core: remove synchronize_net call if nfqueue is used
    
    since commit 960632ece6949b ("netfilter: convert hook list to an array")
    nfqueue no longer stores a pointer to the hook that caused the packet
    to be queued.  Therefore no extra synchronize_net() call is needed after
    dropping the packets enqueued by the old rule blob.
    
    Signed-off-by: Florian Westphal <fw@strlen.de>
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

commit 4e645b47c4f000a503b9c90163ad905786b9bc1d
Author: Florian Westphal <fw@strlen.de>
Date:   Fri Dec 1 00:21:02 2017 +0100

    netfilter: core: make nf_unregister_net_hooks simple wrapper again
    
    This reverts commit d3ad2c17b4047
    ("netfilter: core: batch nf_unregister_net_hooks synchronize_net calls").
    
    Nothing wrong with it.  However, followup patch will delay freeing of hooks
    with call_rcu, so all synchronize_net() calls become obsolete and there
    is no need anymore for this batching.
    
    This revert causes a temporary performance degradation when destroying
    network namespace, but its resolved with the upcoming call_rcu conversion.
    
    Signed-off-by: Florian Westphal <fw@strlen.de>
    Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

These patches make a huge difference. The time to create and destroy a network namespace on a modern 5.19.15 kernel dropped to a ridiculous amount:

# hyperfine 'unshare -n true'
Benchmark 1: 'unshare -n true'
  Time (mean ± σ):       1.5 ms ±   0.5 ms    [User: 0.3 ms, System: 1.3 ms]
  Range (min … max):     0.8 ms …   6.7 ms    1907 runs

[ Modernize your IT with managed cloud services. ]

Mounting mqueue

Mounting mqueue also used to be a relatively expensive operation.

On Fedora 24, it used to be like this:

# mkdir /tmp/mqueue; hyperfine 'unshare --propagation=private -m mount -t mqueue mqueue /tmp/mqueue'; rmdir /tmp/mqueue
Benchmark 1: 'unshare --propagation=private -m mount -t mqueue mqueue /tmp/mqueue'
  Time (mean ± σ):      16.8 ms ±   3.1 ms    [User: 2.6 ms, System: 5.0 ms]
  Range (min … max):     9.3 ms …  26.8 ms    261 runs

I also tried to fix this issue and proposed a patch. It was not accepted, but Al Viro came up with a better version to fix the problem:

commit 36735a6a2b5e042db1af956ce4bcc13f3ff99e21
Author: Al Viro <viro@zeniv.linux.org.uk>
Date:   Mon Dec 25 19:43:35 2017 -0500

    mqueue: switch to on-demand creation of internal mount
    
    Instead of doing that upon each ipcns creation, we do that the first
    time mq_open(2) or mqueue mount is done in an ipcns.  What's more,
    doing that allows to get rid of mount_ns() use - we can go with
    considerably cheaper mount_nodev(), avoiding the loop over all
    mqueue superblock instances; ipcns->mq_mnt is used to locate preexisting
    instance in O(1) time instead of O(instances) mount_ns() would've
    cost us.
    
    Based upon the version by Giuseppe Scrivano <gscrivan@redhat.com>; I've
    added handling of userland mqueue mounts (original had been broken in
    that area) and added a switch to mount_nodev().
    
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

After this patch, the cost to create a mqueue mount dropped as well:

# mkdir /tmp/mqueue; hyperfine 'unshare --propagation=private -m mount -t mqueue mqueue /tmp/mqueue'; rmdir /tmp/mqueue
Benchmark 1: 'unshare --propagation=private -m mount -t mqueue mqueue /tmp/mqueue'
  Time (mean ± σ):       0.7 ms ±   0.5 ms    [User: 0.5 ms, System: 0.6 ms]
  Range (min … max):     0.0 ms …   3.1 ms    772 runs

Create and destroy an IPC namespace

I put off working on container startup time for a couple of years, and got back to it at the beginning of 2020. Another issue was the time to create and destroy an IPC namespace.

You can reproduce the network namespace issue using the unshare tool:

# hyperfine 'unshare -i true'
Benchmark 1: 'unshare -i true'
  Time (mean ± σ):      10.9 ms ±   2.1 ms    [User: 0.5 ms, System: 1.0 ms]
  Range (min … max):     4.2 ms …  17.2 ms    310 runs

This time, the patch version I sent was accepted upstream:

commit e1eb26fa62d04ec0955432be1aa8722a97cb52e7
Author: Giuseppe Scrivano <gscrivan@redhat.com>
Date:   Sun Jun 7 21:40:10 2020 -0700

    ipc/namespace.c: use a work queue to free_ipc
    
    the reason is to avoid a delay caused by the synchronize_rcu() call in
    kern_umount() when the mqueue mount is freed.
    
    the code:
    
        #define _GNU_SOURCE
        #include <sched.h>
        #include <error.h>
        #include <errno.h>
        #include <stdlib.h>
    
        int main()
        {
            int i;
    
            for (i = 0; i < 1000; i++)
                if (unshare(CLONE_NEWIPC) < 0)
                    error(EXIT_FAILURE, errno, "unshare");
        }
    
    goes from
    
            Command being timed: "./ipc-namespace"
            User time (seconds): 0.00
            System time (seconds): 0.06
            Percent of CPU this job got: 0%
            Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.05
    
    to
    
            Command being timed: "./ipc-namespace"
            User time (seconds): 0.00
            System time (seconds): 0.02
            Percent of CPU this job got: 96%
            Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.03
    
    Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
    Reviewed-by: Waiman Long <longman@redhat.com>
    Cc: Davidlohr Bueso <dave@stgolabs.net>
    Cc: Manfred Spraul <manfred@colorfullife.com>
    Link: http://lkml.kernel.org/r/20200225145419.527994-1-gscrivan@redhat.com
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

With this patch in place, the time to create and destroy an IPC has been significantly reduced, as the commit message reports. On a modern 5.19.15 kernel, I now get:

# hyperfine 'unshare -i true'
Benchmark 1: 'unshare -i true'
  Time (mean ± σ):       0.1 ms ±   0.2 ms    [User: 0.2 ms, System: 0.4 ms]
  Range (min … max):     0.0 ms …   1.5 ms    1966 runs

Reducing user time

Kernel time seems under control now. What can I do to reduce the user time?

I found out earlier that libseccomp is the main culprit here, so I tackled it right after the fix for IPC in the kernel.

Most of the cost with libseccomp is due to the syscall lookup code. The OCI configuration file contains a list of syscalls by their name. The seccomp_syscall_resolve_name function call looks up each syscall and returns the syscall number based on the syscall name.

Libseccomp used to perform a linear search through the syscall table for each syscall name. For example, it looked like this for x86_64:

/* NOTE: based on Linux v5.4-rc4 */
const struct arch_syscall_def x86_64_syscall_table[] = { \
	{ "_llseek", __PNR__llseek },
	{ "_newselect", __PNR__newselect },
	{ "_sysctl", 156 },
	{ "accept", 43 },
	{ "accept4", 288 },
	{ "access", 21 },
	{ "acct", 163 },
.....
    };

int x86_64_syscall_resolve_name(const char *name)
{
	unsigned int iter;
	const struct arch_syscall_def *table = x86_64_syscall_table;

	/* XXX - plenty of room for future improvement here */
	for (iter = 0; table[iter].name != NULL; iter++) {
		if (strcmp(name, table[iter].name) == 0)
			return table[iter].num;
	}

	return __NR_SCMP_ERROR;
}

Building up the seccomp profile through libseccomp had a complexity of O(n*m), where n is the number of syscalls in the profile and m is the number of the syscalls known to libseccomp.

I followed the advice in the code comment and spent some time trying to fix it. In January 2020, I worked on a patch for libseccomp to solve the issue using a perfect hash function to look up syscall names.

Here is the libseccomp patch:

commit 9b129c41ac1f43d373742697aa2faf6040b9dfab
Author: Giuseppe Scrivano <gscrivan@redhat.com>
Date:   Thu Jan 23 17:01:39 2020 +0100

    arch: use gperf to generate a perfact hash to lookup syscall names
    
    This patch significantly improves the performance of
    seccomp_syscall_resolve_name since it replaces the expensive strcmp
    for each syscall in the database, with a lookup table.
    
    The complexity for syscall_resolve_num is not changed and it
    uses the linear search, that is anyway less expensive than
    seccomp_syscall_resolve_name as it uses an index for comparison
    instead of doing a string comparison.
    
    On my machine, calling 1000 seccomp_syscall_resolve_name_arch and
    seccomp_syscall_resolve_num_arch over the entire syscalls DB passed
    from ~0.45 sec to ~0.06s.
    
    PM: After talking with Giuseppe I made a number of additional
    changes, some substantial, the highlights include:
    * various style tweaks
    * .gitignore fixes
    * fixed subject line, tweaked the description
    * dropped the arch-syscall-validate changes as they were masking
      other problems
    * extracted the syscalls.csv and file deletions to other patches
      to keep this one more focused
    * fixed the x86, x32, arm, all the MIPS ABIs, s390, and s390x ABIs as
      the syscall offsets were not properly incorporated into this change
    * cleaned up the ABI specific headers
    * cleaned up generate_syscalls_perf.sh and renamed to
      arch-gperf-generate
    * fixed problems with automake's file packaging
    
    Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
    Reviewed-by: Tom Hromatka <tom.hromatka@oracle.com>
    [PM: see notes in the "PM" section above]
    Signed-off-by: Paul Moore <paul@paul-moore.com>

That patch has been merged and released. Now building the seccomp profile has a complexity of O(n) with n the number of syscalls in the profile.

The improvement is significant, with a new enough libseccomp:

# hyperfine 'crun run foo'
Benchmark 1: 'crun run foo'
  Time (mean ± σ):      28.9 ms ±   5.9 ms    [User: 16.7 ms, System: 4.5 ms]
  Range (min … max):    19.1 ms …  41.6 ms    73 runs

The user time is just 16.7ms. It used to be more than 40ms and around 4ms when seccomp was not used.

So using 4.1ms as the user time cost without seccomp, I have:

time_used_by_seccomp_before = 43.0ms - 4.1ms = 38.9ms time_used_by_seccomp_after = 16.7ms - 4.1ms = 12.6ms

More than three times faster! And syscalls lookup is only part of what libseccomp does. It spends another considerable amount of time compiling the BPF filter.

BPF filter compilation

Can I do even better than that? The seccomp_export_bpf function does BPF filter compilation, which is still quite expensive.

One simple observation is that most containers reuse the same seccomp profile repeatedly, with few customizations happening. So it would make sense to cache the compilation's result and reuse it when possible.

[ Read Improving Linux container security with seccomp. ]

There is a new crun feature to cache the BPF filter compilation's result. The patch is not merged at the time of this writing, although it is almost at the finish line.

With that in place, the cost of compiling the seccomp profile is paid only when the generated BPF filter is not in the cache. This is what I have now:

# hyperfine 'crun-from-the-future run foo'
Benchmark 1: 'crun-from-the-future run foo'
  Time (mean ± σ):       5.6 ms ±   3.0 ms    [User: 1.0 ms, System: 4.5 ms]
  Range (min … max):     4.2 ms …  26.8 ms    101 runs

Conclusion

Over five years, the total time needed to create and destroy an OCI container has dropped from almost 160ms to a little more than 5ms.

That is almost a 30-fold improvement!

[ Get this complimentary eBook from Red Hat: Managing your Kubernetes clusters for dummies. ]

This article is adapted from The journey to speed up OCI containers and is republished with permission.

An introduction to crun, a fast and low-memory footprint container runtime

Check out crun, an OCI-compliant alternative to runc for Linux container runtime.

How to manage Linux container registries

There are many options to manage Linux container registries using the registries.conf file.

Container permission denied: How to diagnose this error

Learn what is causing a container permissions error and how to work around the issue without resorting to the --privileged flag.

Topics: Containers Linux

How I decreased the time to create and destroy an OCI container to 5 milliseconds

How things were in 2017

Reducing system time

Create and destroy a network namespace

Mounting mqueue

Create and destroy an IPC namespace

Reducing user time

BPF filter compilation

Conclusion

Giuseppe Scrivano

Try Red Hat Enterprise Linux

Download it at no charge from the Red Hat Developer program.

How I decreased the time to create and destroy an OCI container to 5 milliseconds

How things were in 2017

Reducing system time

Create and destroy a network namespace

Mounting mqueue

Create and destroy an IPC namespace

Reducing user time

BPF filter compilation

Conclusion

Giuseppe Scrivano

Try Red Hat Enterprise Linux

Download it at no charge from the Red Hat Developer program.

Related Content