Pull container images faster with partial pulls

November 12, 2021Giuseppe Scrivano, Dan Walsh6-minute read

Have you ever wondered why it takes so long to pull a container image from a container registry with a container tool like Podman?

$ time podman pull fedora
Resolved "fedora" as an alias (/etc/containers/registries.conf.d/000-shortnames.conf)
Trying to pull registry.fedoraproject.org/fedora:latest...
Getting image source signatures
Copying blob 944c4b241113 done   
Copying config 191682d672 done   
Writing manifest to image destination
Storing signatures
191682d6725209667efcfd197c4dc93be5ab33729b7a4a2a45d5cf2bc1f589e0
 
real    0m19.329s
user    0m4.213s
sys    0m0.829s

This Fedora base image is fairly small and it takes 20 seconds on a high-speed internet connection. I have heard about some huge images taking minutes to pull. Worse, every time the Fedora image—or any image—updates, you have to pull the entire image again, not just the differences.

Large images also lead to storage problems. Have you ever examined how much space your container images are taking on disk? Some users run out of space in their home directories just because they have pulled down hundreds or thousands of container images. These images often contain many duplicate files.

Another thing to think about is kernel memory. The Linux kernel is smart enough to know that if two different processes load the same container (such as a shared library) into memory, they should load it into memory just once. For example, if you have 10 different programs running simultaneously that use libc, the /usr/lib/libc.so.6 code loads into kernel memory only once. When you run containers with images, if the same /usr/lib/libc.so.6 is in multiple different images, then the kernel gets confused and loads several versions of the same content into memory, wasting resources. Because of the way we currently store container images, this is very common.

This article shows new technologies that have been merged into container tools to:

Make pulling images way faster
Make storing files on disk much leaner
Let the Linux kernel know when content can be shared in memory

Container administrators can do all of this by using content previously stored on disk rather than pulling and duplicating the content.

How we've been pulling container images

Open Container Initiative (OCI) container images are distributed as a series of layers. Each layer contains a subset of the files in the image. These layers are stored on a registry as a compressed tarball archive.

When using the overlay backend, the container runtime fetches and extracts each layer to a different directory.

At runtime, each layer is used as a lower layer for the final container overlay filesystem mount.

A Containerfile (Dockerfile) made of the following lines results in a series of different layers:

FROM fedora results in the base Fedora image layer
RUN yum install -y nginx contains all the new files created by yum
COPY ./config/foo /etc/nginx contains the file /etc/nginx/foo

Currently, users can only perform per-layer deduplication. Images can share layers, but it requires discipline. Having more layers can facilitate deduplication, but it has a runtime cost because the overlay filesystem needs to scan more layers for each lookup and build a directory listing.

What the new model is trying to solve

The new storage model is trying to solve a series of problems by making the following changes:

Containerfile authors do not need to worry about how the registry will store the image and optimize deduplication. They can also create squashed images and not worry about deduplication.
A container engine doesn't have to pull files that are already present locally.
Files that are present in multiple layers can be stored only once (this requires filesystem support).
Read-only files that are used by multiple layers or containers can be mapped in memory just once.

[ Learn more about the Red Hat OpenShift Container Platform. ]

Image format options for the new model

The current image format used by container engines, a gzip-compressed tarball, doesn't have enough information to allow for these optimizations.

The first step is to create the layers in a way that allows for individual file retrieval.

Two candidates are currently supported: eStargz and zstd:chunked. These new formats keep the metadata for each file contained in the tarball, including their checksum.

eStargz

eStargz is a file format used by containerd for lazy image pulls. It is based on the Google proof-of-concept project CRFS.

The eStargz format transforms a gzip-compressed layer into an equivalent tarball where each file is compressed individually. The system can retrieve each file without having to fetch and decompress the entire tarball.

The metadata for the layer is appended to the tarball file itself as part of the tar stream. This is a breaking change since the resulting tarball, once decompressed, has a different digest and contains an additional file.

Given some limitations in the container and image APIs, our container tools can already consume these images but cannot create them yet.

zstd:chunked

We've created a new solution named zstd:chunked to address the issue with the eStargz format that changes the DiffID and adds the metadata as part of the tarball. zstd:chunked takes advantage of the zstd compression format.

In the zstd:chunked format, the same metadata used by eStargz is added within the compression stream. The zstd decompressor ignores the additional metadata so that the digest for the uncompressed file doesn't change. In addition, zstd is much faster and compresses better than gzip.

There are some issues related to adopting this format, though:

The Moby project recently merged this pull request, which will add support for zstd in the next version of Docker. Images using zstd won't work on older versions of Docker.
Quay.io doesn't yet accept OCI images, but the issue is being addressed.

How to implement the new formats

When a layer stored in one of these two formats is pulled, the container engine takes these steps:

It retrieves the metadata for the layer from the registry. It is a JSON file that describes the content of the image layer and what files it contains.
Files that are already known locally are duplicated using reflinks. This is currently supported on XFS and BTRFS. If reflinks are not supported, the file is copied, and storage deduplication is not performed.
The container engine prepares an HTTP multirange request that specifies all the files that are not already known locally and requests them from the registry.
New files are created from the data that the registry sends.

So, deduplication is happening at pull time (as known files are not retrieved) and with the storage because the same files are deduplicated with reflinks (where supported).

The partial pull feature doesn't require an additional store for the object files, as it reads them directly from the final checked-out directory where they are stored.

How to extract the tarball

Currently, tarball extraction happens in a separate process that runs in a chroot. This prevents specially crafted images from taking advantage of symlinks resolution and creating files outside the target directory.

Since the new deduplication feature needs to access files outside the target directory, the container runtime cannot use the existing extraction code.

[ You might also be interested in reading about the principles of container-based application design. ]

A new extractor is used when using the partial-pull feature. It needs the openat2 syscall that was added to the Linux kernel 5.6. openat2 allows restricting file lookups with the same behavior as chroot does. If the extractor cannot use the openat2 syscall, the code falls back to the old mechanism to pull the entire layer.

How to deduplicate the host

On systems using OSTree, you can do deduplication with the system files already hashed. For this feature to work, you must enable OSTree tracking by payload checksum, like:

$ ostree --repo=/ostree/repo config set core.payload-link-threshold 100

How to deduplicate memory

Reflinks have different inodes, and the Linux Virtual Filesystem (VFS) layer doesn't know them, as they are handled directly by the filesystem.

When two inodes using reflinks are accessed, the kernel will end up loading the same data in memory twice, even if it is stored just once in the filesystem.

When you need memory deduplication, you can configure it to use hard links instead of reflinks.

We suggest using hard-links deduplication only for limited use cases where memory is scarce. It is a breaking change in the storage model. Some images might behave differently since all the inode metadata is shared (like atime, mtime, ctime) among the files deduplicated with the same inode. Also, the n_link attribute will track how many times the file has been deduplicated.

Avoid locking during extraction

A longstanding issue with container storage is that it keeps a lock while the tarball is extracted and the new layer is created. This happens because the storage drivers need to know the digest for each extracted layer to apply the next one.

As a side effect of the new extractor feature, this issue has been solved because the checkout is created in a separate staging directory. Each layer can be extracted in parallel. Locking is required only for the atomic move of the staging directory to its final destination.

Build a zstd:chunked image

Buildah gained some new options for building a zstd:chunked image. The compression format is specified when the image is pushed to a registry.

$ buildah bud --squash --format oci -t example.com/my-new-zstd-chunked-image
$ buildah push --compression-format zstd:chunked example.com/my-new-zstd-chunked-image

Enable and use partial pulls

The partial-pulls feature is still experimental, and it is not enabled by default.

To enable it, you must add the following configuration in the storage.conf file under storage.options:

pull_options = {enable_partial_images = "true", use_hard_links = "false", ostree_repos = “”}

These additional flags control how the deduplication is performed:

use_hard_links tells the container engine to use hard links for the deduplication.
ostree_repos is a column-separated list of OSTree repositories to use for looking up files.

Wrap up

The new storage model attempts to use disk space better and reduce memory consumption. Pulls may be more efficient and therefore quicker, too. For more insight, watch this demo that shows how partial pulls can improve Podman pulls.

About the authors

Giuseppe Scrivano

Giuseppe is an engineer in the containers runtime team at Red Hat. He enjoys working on everything that is low level. He contributes to projects like Podman and CRI-O.

Read full bio

Dan Walsh

Senior Distinguished Engineer

Daniel Walsh has worked in the computer security field for over 30 years. Dan is a Senior Distinguished Engineer at Red Hat. He joined Red Hat in August 2001. Dan leads the Red Hat Container Engineering team since August 2013, but has been working on container technology for several years.

Dan helped developed sVirt, Secure Virtualization as well as the SELinux Sandbox back in RHEL6 an early desktop container tool. Previously, Dan worked Netect/Bindview's on Vulnerability Assessment Products and at Digital Equipment Corporation working on the Athena Project, AltaVista Firewall/Tunnel (VPN) Products. Dan has a BA in Mathematics from the College of the Holy Cross and a MS in Computer Science from Worcester Polytechnic Institute.

Read full bio

Browse by channel

Explore all channels

Engage & learn

Services & support

Services