Disclaimer: the article is highly opinionated and the author is a total newbie to the topic.
Last update: 2020-08-09
Containers gave birth to more advanced server-side architectures and sophisticated deployment techniques. Containers nowadays are so widespread that there is already a bunch of standard-alike specifications (1, 2, 3, 4, ...) describing different aspects of the containers universe. Of course, on the lowest level lie Linux primitives such as namespaces and cgroups. But containerization software is already so massive that it would be barely possible to implement it without its own concern separation layers. What I'm trying to achieve in this ongoing effort is to guide myself starting from the lowest layers to the topmost ones, having as much practice (code, installation, configuration, integration, etc) and, of course, fun as possible. The content of this page is going to be changing over time, reflecting my understanding of the topic.
I want to start the journey from the lowest level non-kernel primitive - container runtime. The word runtime is a bit ambiguous in the containerverse. Each project, company or community has its own and usually context-specific understanding of the term container runtime. Mostly, the hallmark of the runtime is defined by the set of responsibilities varying from a bare minimum (creating namespaces, starting init process) to comprehensive container management including (but not limiting) images operation. A good overview of runtimes can be found in this article.
This section is dedicated to low-level container runtimes. A group of big players forming an Open Container Initiative standartized the low-level runtime in the OCI runtime specification. Making a long story short - a low-level container runtime is a piece of software that takes as an input a folder containing rootfs and a configuration [file] describing container parameters (such as resource limits, mount points, process to start, etc) and as a result the runtime starts an isolated process, i.e. container.
As of 2019, the most widely used container runtime is runc. This project started as a part of Docker (hence it's written in Go) but eventually was extracted and transformed into a self-sufficient CLI tool. It is difficult to overestimate the importance of this component - runc is basically a reference implementation of the OCI runtime specification. During our journey we will work a lot with runc and here is an introductory article.
One more notable OCI runtime implementation out there is crun. It's written in C and can be used both as an executable and as a library. It's been originated by Red Hat and other Red Hat projects like buildah or podman tend to use
crun instead of
UPDATE Aug 2020: Historically, Docker was advocating for a "single container - single process" model and despite of some constant demand to bring containers somewhat closer to traditional virtual machines (with
systemd or alike manager inside), it has never been officially supported by runc (while it's still possible via running tailored privileged containers). Luckily, there is a fresh (~90%) OCI-compatible container runtime called sysbox aiming to solve this problem.
Sysbox is an open-source container runtime (aka runc), originally developed by Nestybox, that enables Docker containers to act as virtual servers capable of running software such as Systemd, Docker, and Kubernetes in them, easily and with proper isolation. This allows you to use containers in new ways, and provides a faster, more efficient, and more portable alternative to virtual machines in many scenarios.
Using runc in command line we can launch as many containers as we want. But what if we need to automate this process? Imagine we need to launch tens of containers keeping track of their statuses. Some of them need to be restarted on failure, resources need to be released on termination, images have to be pulled from registries, inter-containers networks need to be configured and so on. This is already a slightly higher-level job and it's a responsibility of a container manager. To be honest, I have no idea whether this term is in common use or not, but I found it convenient to structure things this way. I would classify the following projects as container managers: containerd, cri-o, dockerd, and podman.
As with runc, we again can see a Docker's heritage here - containerd used to be a part of the original Docker project. Nowadays though containerd is another self-sufficient piece of software. It calls itself a container runtime but obviously, it's not the same kind of runtime runc is. Not only the responsibilities of containerd and runc differ but also the organizational form does. While runc is a just a command-line tool, containerd is a long-living daemon. An instance of runc cannot outlive an underlying container process. Normally it starts its life on
create invocation and then just
execs the specified file from container's rootfs on
start. On the other hand, containerd can outlive thousands of containers. It's rather a server listening for incoming requests to start, stop, or report the status of the containers. And under the hood containerd uses runc. However, containerd is more than just a container lifecycle manager. It is also responsible for image management (pull & push images from a registry, store images locally, etc), cross-container networking management and some other functions.
Another container manager example is cri-o. While containerd evolved as a result of Docker re-architecting, cri-o has its roots in the Kubernetes realm. Back in the day, Kubernetes (ab)used Docker to manage containers. However, with rising of rkt some brave people added support of interchangeably container runtimes in Kubernetes, allowing container management to be done by Docker and/or rkt. This change, however, led to a huge number of conditional code in Kubernetes and nobody likes too many
ifs in the code. As a result, Container Runtime Interface (CRI) has been introduced to Kubernetes making it possible to any CRI-compliant higher-level runtimes (i.e. container managers) to be used without any code change on the Kubernetes side. And cri-o is a Red Hat's implementation of CRI-complaint runtime. As with containerd cri-o is also a daemon exposing a [gRPC] server with endpoints to create, start, stop (and many more other actions) containers. Under the hood, cri-o can use any OCI-compliant [low-level] runtimes to work with containers but the default one is
again runc. The main focus of cri-o is being a Kubernetes container runtime. The versioning is the same as k8s versioning, the scope of the project is well-defined and the code base is expectedly smaller (as of July 2019 it's around 20 CLOC and it's approximately 5 times less than containerd).
cri-o architecture (image from cri-o.io)
The nice thing about specifications is that everything complaint can be used interchangeably. Once CRI has been introduced, a plugin for containerd, implementing CRI gRPC server on top of containerd's functionality appeared. The idea happened to be viable and later on, containerd itself got a native CRI support. Thus, Kubernetes can use both cri-o and containerd as a runtime.
One more daemon here is dockerd. This daemon is manifold. On the one hand, it exposes an API for Docker command-line client which gives us all these famous Docker workflows (
docker stats, etc). But since we already know, that this piece of functionality has been extracted to containerd it's not a surprise that under the hood dockerd relies on containerd. But that basically would mean that dockerd is just a front-end adapter converting containerd API to a historically widely used docker engine API.
However, dockerd also provides
compose and swarm things in an attempt to solve container orchestration problem, including multi-machine clusters of containers. As we can see with Kubernetes, this problem is rather hard to address. And having two big responsibilities for a single dockerd daemon doesn't sound good to me.
UPDATE from November, 2019 - Mirantis acquired Docker Enterprise:
A blog post announcing the acquisition implied that Docker Swarm will be phased out, with Ionel writing that “The primary orchestrator going forward is Kubernetes. Mirantis is committed to providing an excellent experience to all Docker Enterprise platform customers and currently expects to support Swarm for at least two years, depending on customer input into the roadmap. Mirantis is also evaluating options for making the transition to Kubernetes easier for Swarm users.”
dockerd is a part in front of containerd (image from Docker Blog)
An interesting exception from this daemons list is podman. It's yet another Red Hat project and the aim is to provide a library (not a daemon) called
libpod to manage images, container lifecycles, and pods (groups of containers). And
podman is a management command-line tool built on top of this library. As a low-level container runtime this project
as usual uses runc. There is a lot in common between podman and cri-o (both are Red Hat projects) from the code standpoint. For instance, they both heavily use outstanding storage and image libraries internally. It seems that there's an ongoing effort to use libpod instead of runc directly in cri-o. Another interesting feature of podman is a drop-in replacement of some (most popular?)
docker commands in a daily workflow. The project claims compatibility (to some extent of course) of the docker CLI API.
Why start a project like this when we already have dockerd, containerd or cir-o? The problem with daemons as container managers is that most of the time a daemon has to be run with root privileges. Even though 90% of the daemon's functionality can be hypothetically done without having root rights in the system since daemon is a monolithic thing, the remaining 10% requires that the daemon is launched as root. With podman, it's finally possible to have rootless containers utilizing Linux user namespaces. And this can be a big deal, especially in extensive CI or multi-tenant environments, because even non-privileged Docker containers are actually only one kernel bug away from gaining root access on the system.
Here is my (WiP) project aiming at the implementation of a tiny container manager. It's primarily for educational purposes, but the ultimate goal though is to make it CRI-compatible and launch a Kubernetes cluster with conman as a container runtime. Here you can find an introductory article about the project.
Containers can be long-running while a container manager may need to be restarted due to a crash, update, or some other unforeseen reasons without killing (or losing control of) the managed containers. Thus, the container's process has to be fully independent of the manager's process while still preserving some communicational means. If you try to implement it yourself, you will find that with runc as a container runtime it gets complicated pretty quickly. Following is a list of difficulties need to be addressed.
Keep container's stdio streams open on container manager restart
We need to forward the data written by the container to its STDOUT and STDERR to logs at any given moment of time, regardless of the state of the manager. Surprisingly, the design of runc makes it a non-trivial task because runc tend to pass through the caller's stdio streams to the container. Thus, making the runc process independent of its parent becomes barely possible. If we launch runc in the detached mode and then unluckily terminate the parent, an attempt to read from STDIN or write to STDOUT/STDERR can cause the container to be killed by SIGPIPE.
Keep track of container exit code
OCI runtime specification standartizes the launching of a container as a two-steps procedure. First, the container needs to be created (i.e. fully initialized) and only then started. On container creation step, runc deliberately daemonizes the container process by forking and then exiting the foreground process. Having containers detached leads to an absence of container status update. One way to address this problem is to make the process launching runc a subreaper. Then we can teach it to wait for the container's process termination and report its exit code to a predefined location (eg. a file on the disk). Of course, the subreaper process needs to stay alive until the end of the container's existence.
Synchronize container manager and runc container creation
Since runc daemonizes the container creation process we need a side-channel (eg. a Unix socket) to communicate the actual start (or failure) of the container back to the container manager.
Attaching to a running container
We need to provide a way to stream some data in and out from the container, including PTY-controlled scenarios. It sounds like we need a listening server accepting attaching connections and performing the streaming to and from the container's stdio. And again this server has to be long-lived.
To address all these problems (and probably some others) so-called runtime shims are usually used. A shim is a lightweight daemon controlling a running container. The shim's process is tightly bound to the container's process but is completely detached from the manager's process. All the communications between the container and the manager happen through the shim. Examples of the shims out there are conmon and containerd runtime shim. I spent some time trying to implement my own shim as a part of the conman project and the results can be found in the article "Implementing container runtime shim".
Since we have multiple container runtimes (or managers) with overlapping responsibilities it's pretty much obvious that we either need to extract networking-related code to a dedicated project and then reuse it, or each runtime should have its own way to configure NIC devices, IP routing, firewalls, and other networking aspects. For instance, both cri-o and containerd have to create Linux network namespaces and setup Linux
veth devices to create sandboxes for Kubernetes pods. To address this problem, the Container Network Interface project was introduced.
The CNI project provides a Container Network Interface Specification defining a CNI Plugin. A plugin is an executable [sic] which is supposed to be called by container runtime (or manager) to set up (or release) a network resource. Plugins can be used to create a network interface, manage IP addresses allocation, or do some custom configuration of the system. CNI project is language-agnostic, and since a plugin defined as an executable, it can be used in a runtime management system implemented in any programming language. However, CNI project also provides a set of reference plugin implementations for the most popular use cases shipped as a separate repository named plugins. Examples are bridge, loopback, flannel, etc.
Orchestration of the containers is an extra-large topic. In reality, the biggest part of the Kubernetes code addresses rather the orchestration problem than containerization. Thus, orchestration deserves its own article (or a few). Hopefully, they will follow soon.
Buildah is a command-line tool to work with OCI container images. It's a part of a group of projects (podman, skopeo, buildah) started by RedHat with an aim at redesigning Docker's way to work with containers (primarily to switch from monolithic and daemon-based to more fine-grained approach).
CNI Project defines a Container Network Interface plugin specification as well as some Go tools to work with it. For a more in-depth explanation see the corresponding section of the article.
A home repository for the most popular CNI plugins (such as bridge, host-device, loopback, dhcp, firewall, etc). For a more in-depth explanation see the corresponding section of the article.
A higher-level container runtime (or container manager) started as a part of Docker and extracted to an independent project. For a more in-depth explanation see the corresponding section of the article.
A tiny OCI runtime shim written in C and used primarily by cri-o. It provides synchronization between a parent process (cri-o) and the starting containers, tracking of container exit codes, PTY forwarding, and some other features. For a more in-depth explanation see the corresponding section of the article.
Kubernetes-focused container manager following Kubernetes Container Runtime Interface (CRI) specification. The versioning is same as k8s versioning. For a more in-depth explanation see the corresponding section of the article.
Yet another OCI runtime spec implementation. It claims to be a "...fast and low-memory footprint OCI Container Runtime fully written in C." But the most importantly it can be used as a library from any C/C++ code (or providing bindings - from other languages). It allows avoiding some runc specific drawbacks caused by its daemon-nature. See Runtime Shims section for more.
An underrated (warn: opinions!) Go library powered such well-known projects as cri-o, podman and skopeo. Probably it's easy to guess by its name - the aim is at working in various way with containers' images and container image registries.
An alternative and low-level container runtime written in C.
A higher-level container runtime (or container manager) written in Go. Under the hood, it uses lxc as low-level runtime.
A higher-level container runtime (or container manager) formerly known as
docker/docker. Provides a well-known Docker engine API based on containerd functionality. For a more in-depth explanation see the corresponding section of the article.
A specification of a container image distribution (WiP).
A specification of a container image.
A specification of a [low level] container runtime.
A daemon-less Docker replacement. The frontman project of Docker redesign effort. More explanation can be found on RedHat developers blog.
Another container management system. It provides a low-level runtime as well as a higher-level management interface. It advertises to be pod-native. An idea to add rkt support to Kubernetes gave birth to CRI specification. The project was started by CoreOS team ~5 years ago, but after its acquisition by RedHat, it is rather stagnating. As of August 2019, the last commit to the project is about 2 months old. UPDATE: On August, 16th, CNCF announced that the Technical Oversight Committee (TOC) has voted to archive the rkt project.
A low-level container runtime and a reference implementation of OCI runtime spec. Started as a part of Docker and extracted to an independent project. Extremely ubiquitous. For a more in-depth explanation see the corresponding section of the article.
Skopeo is a command-line utility that performs various operations on container images and image repositories. It's a part of RedHat effort to redesign Docker (see also podman and buildah) by extracting its responsibilities to dedicated and independent tools.
An underrated (warn: opinions!) Go library powered such well-known projects as cri-o, podman and skopeo. is a Go library which aims to provide methods for storing filesystem layers, container images, and containers (on disk). It also manages mounting of bundles.
An open-source container runtime that enables Docker containers to act as virtual servers capable of running software such as Systemd with proper isolation.