Linux Containers: What, Why, How

Jan 15, 2018

It’s about 5 years after the container hype started, and to be honest, I’m still at the what-even-are-these stage. So last week I spent some time looking at them! In this post I mostly want to summarize:

To start with, containers (Linux Containers, Docker, rkt) refer to a method of runtime isolation, adding spawned processes to Linux namespaces (isolation) and cgroups (resource limits). Commonly, when people talk about containers, they’re also referring to an archive format to ship around the necessary data to instantiate a container.

What problems do containers address?

Packaging

We’d like to be able to capture all the runtime dependencies (code libraries, artifacts, config files and runtime environment (e.g. environment variables)) needed to run an application into an archive that can be freely delivered to another machine. This especially varies across languages and ecosystems.

In an ideal world, there would be no implicit dependencies, no environment dependent bugs, and no dependency version conflicts with other applications.

Runtime Isolation

Once we deploy an application, we want consistent behavior regardless of whatever other applications are running on the same machine. Ideally, the process is isolated from the host and other applications, the process does not have access to unspecified resources (i.e. the packaging format truly captures all dependencies) and applications cannot interfere with each other.

Other benefits

Usually containers are mentioned for a bunch of other benefits too:

Containers sound great, but the effectiveness comes down to how well applications are packaged and isolated in reality. Other tools approach these problems different like virtual machines for runtime isolation, configuration management (e.g. puppet/ansible) for host configuration and dependencies, or tarballs/RPMs as packages.

How they work

Under the hood, containers are just another archive format for packaging and combination of Linux namespaces and cgroups for isolation. Docker seems to be the default container choice, so I dug into what it does.

Packaging

Docker packs everything you need for your application into a Docker Image. Images include metadata (e.g. entry point, environment variables, args, resource limits) and all the files/directories needed for the contained application. Rather an manually create an archive, images are more conveniently specified by a Dockerfile, which contains instructions to build an image (e.g. a base, directories, how to install dependencies), like this example:

FROM python:2.7-slim
WORKDIR /app
ADD . /app
RUN pip install --trusted-host pypi.python.org -r requirements.txt
CMD ["python", "app.py"]

Really, an image could just be a tarball, and the build file would just describe how to copy files before archiving. But, Docker optimizes images by creating them as read-only layers. An image starts with a base, and then each layer after specifies the diff. Rather than redundant tarballs, many images with the same base can share large portions of the image. If a container modifies a file during its runtime, it is copied into a small writeable layer, and Docker uses specific storage drivers that can resolve a file reference into its location in the layers.

The advantage is builds are faster (layers can be cached), images take less space (e.g. all containers may share the same base), and startup and teardown is faster (only allocate the writeable space).

Runtime Isolation

The fundamental Linux features that isolate a process are cgroups (resource limits) and namespaces (isolated views of resources).

Cgroups

Cgroups provide a way to specify and enforce resource limits, namely memory limits and cpu utilization. For example, you can specify a process group gets 1GB of memory before it swaps, or 10% of CPU time.

Linux namespaces

Linux namespaces let us isolate resources for a process group: a process cannot see the other resources outside its namespace.

The possible namespaces via the man page on namespaces:

When docker instantiates an image into a container, it must:

But at the end of the day, it’s just a Linux process.

This article was also pretty helpful, it shows a simplified container by creating a child process, setting the namespace and pivoting the root.

More?

I’m finding I learn best by finding a mental model for a technology, and this is how I view containers. I also tried running strace on docker when starting docker run hello-world and thankfully found nothing too unexpected:

# Setting some cgroups
[pid 20418] mkdirat(AT_FDCWD, "/sys/fs/cgroup/memory/docker/03dd30fd0992dccf8d166d7963f150c96286be38e05c29ccb2947f2e317564f3", 0755) = 0
[pid 20418] openat(AT_FDCWD, "/sys/fs/cgroup/memory/docker/03dd30fd0992dccf8d166d7963f150c96286be38e05c29ccb2947f2e317564f3/cgroup.procs", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0700) = 6
[pid 20418] fcntl(6, F_GETFL)           = 0x8001 (flags O_WRONLY|O_LARGEFILE)
[pid 20418] fcntl(6, F_SETFL, O_WRONLY|O_NONBLOCK|O_LARGEFILE) = 0
[pid 20418] write(6, "20419", 5)        = 5
...
# Clone and create new namespaces
[pid 20419] clone(strace: Process 20420 attached
 <unfinished ...>
[pid 20420] getpid()                    = 20420
[pid 20420] close(6)                    = 0
[pid 20420] prctl(PR_SET_NAME, "runc:[1:CHILD]\0r") = 0
[pid 20420] unshare(CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWPID|CLONE_NEWNET) = 0
...
# Mounting into image fs? Setting up environment?
[pid 20422] mount("/var/lib/docker/overlay2/2abdd3efa33d576f71671374bd4a8bfd9f1e5a90f7d9e4ea67573ec18d68c1f6/merged", "/var/lib/docker/overlay2/2abdd3efa33d576f71671374bd4a8bfd9f1e5a90f7d9e4ea67573ec18d68c1f6/merged", 0xc4200e5e20, MS_BIND|MS_REC, NULL <unfinished ...>
[pid 20422] <... mount resumed> )       = 0
[pid 20422] stat("/var/lib/docker/overlay2/2abdd3efa33d576f71671374bd4a8bfd9f1e5a90f7d9e4ea67573ec18d68c1f6/merged/proc", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[pid 20422] mount("proc", "/var/lib/docker/overlay2/2abdd3efa33d576f71671374bd4a8bfd9f1e5a90f7d9e4ea67573ec18d68c1f6/merged/proc", "proc", MS_NOSUID|MS_NODEV|MS_NOEXEC, NULL) = 0
[pid 20422] stat("/var/lib/docker/overlay2/2abdd3efa33d576f71671374bd4a8bfd9f1e5a90f7d9e4ea67573ec18d68c1f6/merged/dev", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
...

There’s obviously a lot more if you want to dig into it, even as noted by the runc and containerd in the strace output.