Have you ever wondered why it takes so long to pull a container image from a container registry with a container tool like Podman?
$ time podman pull fedora
Resolved "fedora" as an alias (/etc/containers/registries.conf.d/000-shortnames.conf)
Trying to pull registry.fedoraproject.org/fedora:latest...
Getting image source signatures
Copying blob 944c4b241113 done
Copying config 191682d672 done
Writing manifest to image destination
Storing signatures
191682d6725209667efcfd197c4dc93be5ab33729b7a4a2a45d5cf2bc1f589e0
real 0m19.329s
user 0m4.213s
sys 0m0.829s
This Fedora base image is fairly small and it takes 20 seconds on a high-speed internet connection. I have heard about some huge images taking minutes to pull. Worse, every time the Fedora image—or any image—updates, you have to pull the entire image again, not just the differences.
Large images also lead to storage problems. Have you ever examined how much space your container images are taking on disk? Some users run out of space in their home directories just because they have pulled down hundreds or thousands of container images. These images often contain many duplicate files.
Another thing to think about is kernel memory. The Linux kernel is smart enough to know that if two different processes load the same container (such as a shared library) into memory, they should load it into memory just once. For example, if you have 10 different programs running simultaneously that use libc
, the /usr/lib/libc.so.6
code loads into kernel memory only once. When you run containers with images, if the same /usr/lib/libc.so.6
is in multiple different images, then the kernel gets confused and loads several versions of the same content into memory, wasting resources. Because of the way we currently store container images, this is very common.
This article shows new technologies that have been merged into container tools to:
- Make pulling images way faster
- Make storing files on disk much leaner
- Let the Linux kernel know when content can be shared in memory
Container administrators can do all of this by using content previously stored on disk rather than pulling and duplicating the content.
How we've been pulling container images
Open Container Initiative (OCI) container images are distributed as a series of layers. Each layer contains a subset of the files in the image. These layers are stored on a registry as a compressed tarball archive.
When using the overlay backend, the container runtime fetches and extracts each layer to a different directory.
At runtime, each layer is used as a lower layer for the final container overlay filesystem mount.
A Containerfile (Dockerfile) made of the following lines results in a series of different layers:
FROM fedora
results in the base Fedora image layerRUN yum install -y nginx
contains all the new files created byyum
COPY ./config/foo /etc/nginx
contains the file/etc/nginx/foo
Currently, users can only perform per-layer deduplication. Images can share layers, but it requires discipline. Having more layers can facilitate deduplication, but it has a runtime cost because the overlay filesystem needs to scan more layers for each lookup and build a directory listing.
What the new model is trying to solve
The new storage model is trying to solve a series of problems by making the following changes:
- Containerfile authors do not need to worry about how the registry will store the image and optimize deduplication. They can also create squashed images and not worry about deduplication.
- A container engine doesn't have to pull files that are already present locally.
- Files that are present in multiple layers can be stored only once (this requires filesystem support).
- Read-only files that are used by multiple layers or containers can be mapped in memory just once.
[ Learn more about the Red Hat OpenShift Container Platform. ]
Image format options for the new model
The current image format used by container engines, a gzip-compressed tarball, doesn't have enough information to allow for these optimizations.
The first step is to create the layers in a way that allows for individual file retrieval.
Two candidates are currently supported: eStargz and zstd:chunked. These new formats keep the metadata for each file contained in the tarball, including their checksum.
eStargz
eStargz is a file format used by containerd
for lazy image pulls. It is based on the Google proof-of-concept project CRFS.
The eStargz format transforms a gzip-compressed layer into an equivalent tarball where each file is compressed individually. The system can retrieve each file without having to fetch and decompress the entire tarball.
The metadata for the layer is appended to the tarball file itself as part of the tar stream. This is a breaking change since the resulting tarball, once decompressed, has a different digest and contains an additional file.
Given some limitations in the container and image APIs, our container tools can already consume these images but cannot create them yet.
zstd:chunked
We've created a new solution named zstd:chunked to address the issue with the eStargz format that changes the DiffID and adds the metadata as part of the tarball. zstd:chunked takes advantage of the zstd compression format.
In the zstd:chunked format, the same metadata used by eStargz is added within the compression stream. The zstd decompressor ignores the additional metadata so that the digest for the uncompressed file doesn't change. In addition, zstd is much faster and compresses better than gzip.
There are some issues related to adopting this format, though:
- The Moby project recently merged this pull request, which will add support for zstd in the next version of Docker. Images using zstd won't work on older versions of Docker.
- Quay.io doesn't yet accept OCI images, but the issue is being addressed.
How to implement the new formats
When a layer stored in one of these two formats is pulled, the container engine takes these steps:
- It retrieves the metadata for the layer from the registry. It is a JSON file that describes the content of the image layer and what files it contains.
- Files that are already known locally are duplicated using reflinks. This is currently supported on XFS and BTRFS. If reflinks are not supported, the file is copied, and storage deduplication is not performed.
- The container engine prepares an HTTP multirange request that specifies all the files that are not already known locally and requests them from the registry.
- New files are created from the data that the registry sends.
So, deduplication is happening at pull time (as known files are not retrieved) and with the storage because the same files are deduplicated with reflinks (where supported).
The partial pull feature doesn't require an additional store for the object files, as it reads them directly from the final checked-out directory where they are stored.
How to extract the tarball
Currently, tarball extraction happens in a separate process that runs in a chroot. This prevents specially crafted images from taking advantage of symlinks resolution and creating files outside the target directory.
Since the new deduplication feature needs to access files outside the target directory, the container runtime cannot use the existing extraction code.
[ You might also be interested in reading about the principles of container-based application design. ]
A new extractor is used when using the partial-pull feature. It needs the openat2
syscall that was added to the Linux kernel 5.6. openat2
allows restricting file lookups with the same behavior as chroot
does. If the extractor cannot use the openat2
syscall, the code falls back to the old mechanism to pull the entire layer.
How to deduplicate the host
On systems using OSTree, you can do deduplication with the system files already hashed. For this feature to work, you must enable OSTree tracking by payload checksum, like:
$ ostree --repo=/ostree/repo config set core.payload-link-threshold 100
How to deduplicate memory
Reflinks have different inodes, and the Linux Virtual Filesystem (VFS) layer doesn't know them, as they are handled directly by the filesystem.
When two inodes using reflinks are accessed, the kernel will end up loading the same data in memory twice, even if it is stored just once in the filesystem.
When you need memory deduplication, you can configure it to use hard links instead of reflinks.
We suggest using hard-links deduplication only for limited use cases where memory is scarce. It is a breaking change in the storage model. Some images might behave differently since all the inode metadata is shared (like atime
, mtime
, ctime
) among the files deduplicated with the same inode. Also, the n_link
attribute will track how many times the file has been deduplicated.
Avoid locking during extraction
A longstanding issue with container storage is that it keeps a lock while the tarball is extracted and the new layer is created. This happens because the storage drivers need to know the digest for each extracted layer to apply the next one.
As a side effect of the new extractor feature, this issue has been solved because the checkout is created in a separate staging directory. Each layer can be extracted in parallel. Locking is required only for the atomic move of the staging directory to its final destination.
Build a zstd:chunked image
Buildah gained some new options for building a zstd:chunked image. The compression format is specified when the image is pushed to a registry.
$ buildah bud --squash --format oci -t example.com/my-new-zstd-chunked-image
$ buildah push --compression-format zstd:chunked example.com/my-new-zstd-chunked-image
Enable and use partial pulls
The partial-pulls feature is still experimental, and it is not enabled by default.
To enable it, you must add the following configuration in the storage.conf
file under storage.options
:
pull_options = {enable_partial_images = "true", use_hard_links = "false", ostree_repos = “”}
These additional flags control how the deduplication is performed:
use_hard_links
tells the container engine to use hard links for the deduplication.ostree_repos
is a column-separated list of OSTree repositories to use for looking up files.
Wrap up
The new storage model attempts to use disk space better and reduce memory consumption. Pulls may be more efficient and therefore quicker, too. For more insight, watch this demo that shows how partial pulls can improve Podman pulls.
About the authors
Giuseppe is an engineer in the containers runtime team at Red Hat. He enjoys working on everything that is low level. He contributes to projects like Podman and CRI-O.
Daniel Walsh has worked in the computer security field for over 30 years. Dan is a Senior Distinguished Engineer at Red Hat. He joined Red Hat in August 2001. Dan leads the Red Hat Container Engineering team since August 2013, but has been working on container technology for several years.
Dan helped developed sVirt, Secure Virtualization as well as the SELinux Sandbox back in RHEL6 an early desktop container tool. Previously, Dan worked Netect/Bindview's on Vulnerability Assessment Products and at Digital Equipment Corporation working on the Athena Project, AltaVista Firewall/Tunnel (VPN) Products. Dan has a BA in Mathematics from the College of the Holy Cross and a MS in Computer Science from Worcester Polytechnic Institute.
More like this
Browse by channel
Automation
The latest on IT automation for tech, teams, and environments
Artificial intelligence
Updates on the platforms that free customers to run AI workloads anywhere
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
The latest on how we reduce risks across environments and technologies
Edge computing
Updates on the platforms that simplify operations at the edge
Infrastructure
The latest on the world’s leading enterprise Linux platform
Applications
Inside our solutions to the toughest application challenges
Original shows
Entertaining stories from the makers and leaders in enterprise tech
Products
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Cloud services
- See all products
Tools
- Training and certification
- My account
- Customer support
- Developer resources
- Find a partner
- Red Hat Ecosystem Catalog
- Red Hat value calculator
- Documentation
Try, buy, & sell
Communicate
About Red Hat
We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.
Select a language
Red Hat legal and privacy links
- About Red Hat
- Jobs
- Events
- Locations
- Contact Red Hat
- Red Hat Blog
- Diversity, equity, and inclusion
- Cool Stuff Store
- Red Hat Summit