Linux Containers: What, Why, How

It’s about 5 years after the container hype started, and to be honest, I’m still at the what-even-are-these stage. So last week I spent some time looking at them! In this post I mostly want to summarize:

Why containers
What/how containers (the mental model)
Other options?

To start with, containers (Linux Containers, Docker, rkt) refer to a method of runtime isolation, adding spawned processes to Linux namespaces (isolation) and cgroups (resource limits). Commonly, when people talk about containers, they’re also referring to an archive format to ship around the necessary data to instantiate a container.

What problems do containers address?

Packaging

We’d like to be able to capture all the runtime dependencies (code libraries, artifacts, config files and runtime environment (e.g. environment variables)) needed to run an application into an archive that can be freely delivered to another machine. This especially varies across languages and ecosystems.

In an ideal world, there would be no implicit dependencies, no environment dependent bugs, and no dependency version conflicts with other applications.

Runtime Isolation

Once we deploy an application, we want consistent behavior regardless of whatever other applications are running on the same machine. Ideally, the process is isolated from the host and other applications, the process does not have access to unspecified resources (i.e. the packaging format truly captures all dependencies) and applications cannot interfere with each other.

Other benefits

Usually containers are mentioned for a bunch of other benefits too:

Environment consistency: For application devs, containers give a way to specify a consistent, predictable environment. If it runs on a developer machine, it can run anywhere (e.g. test environments, production): no environment specific bugs!
Standardized deployment unit: By standardizing an encompassing packaging format and a runtime for any application, containers become the unit of applications enabling higher level infrastructure like schedulers.
Portability: Once you’ve packaged a container, anywhere that can run a container can run it without extra work.
Separation of concerns: Containers draw a line between infrastructure and application. An application developer just cares about dependencies, but shouldn’t need to know about the host machine’s configuration. On the infrastructure side, this limits the extent of weird quirky applications to build around.
Utilization: If we can isolate applications and define resource limits, containers become a unit of scaling instead of a host or virtual machine. We can more tightly pack applications into a single machine with less overhead of head room. We can possibly run processes with different resource needs together (e.g. disk bound vs cpu bound? No idea if that’s actually a good idea in practice)
Security: In theory, isolating applications and limiting privileges sounds great, but security seems like a big point of contention between containers and alternatives (e.g. VMs, BSD Jails). I haven’t dug too far into this, but it seems like there’s a lot that can be done to harden containers (apparmor, seccomp are words I’ve encountered).

Containers sound great, but the effectiveness comes down to how well applications are packaged and isolated in reality. Other tools approach these problems different like virtual machines for runtime isolation, configuration management (e.g. puppet/ansible) for host configuration and dependencies, or tarballs/RPMs as packages.

How they work

Under the hood, containers are just another archive format for packaging and combination of Linux namespaces and cgroups for isolation. Docker seems to be the default container choice, so I dug into what it does.

Packaging

Docker packs everything you need for your application into a Docker Image. Images include metadata (e.g. entry point, environment variables, args, resource limits) and all the files/directories needed for the contained application. Rather an manually create an archive, images are more conveniently specified by a Dockerfile, which contains instructions to build an image (e.g. a base, directories, how to install dependencies), like this example:

FROM python:2.7-slim
WORKDIR /app
ADD . /app
RUN pip install --trusted-host pypi.python.org -r requirements.txt
CMD ["python", "app.py"]

Really, an image could just be a tarball, and the build file would just describe how to copy files before archiving. But, Docker optimizes images by creating them as read-only layers. An image starts with a base, and then each layer after specifies the diff. Rather than redundant tarballs, many images with the same base can share large portions of the image. If a container modifies a file during its runtime, it is copied into a small writeable layer, and Docker uses specific storage drivers that can resolve a file reference into its location in the layers.

The advantage is builds are faster (layers can be cached), images take less space (e.g. all containers may share the same base), and startup and teardown is faster (only allocate the writeable space).

Runtime Isolation

The fundamental Linux features that isolate a process are cgroups (resource limits) and namespaces (isolated views of resources).

Cgroups

Cgroups provide a way to specify and enforce resource limits, namely memory limits and cpu utilization. For example, you can specify a process group gets 1GB of memory before it swaps, or 10% of CPU time.

Linux namespaces

Linux namespaces let us isolate resources for a process group: a process cannot see the other resources outside its namespace.

The possible namespaces via the man page on namespaces:

Network (Network devices, stacks, ports, etc.): Isolate processes into their own network (by isolating network devices, IP stacks, ports, firewalls, and more). It is possible to create a network link between namespaces, providing “a pipe-like abstraction”
Mount (Mount points): A namespace gets its own mount table, and can mount/unmount directories without affecting others.
PID (Process IDs): Isolates process IDs. Namespaced PIDs get mapped to an actual PID. Notably the first process gets PID 1 within a container (if a child process dies it gets reparented to PID 1).
User (User and group IDs): Isolate user and group IDs. Like PIDs, mapped to an actual user ID outside the namespace, so a process can have full privileges within the user namespace as UID 0, but is mapped to a normal, unprivileged user externally.
UTS (Hostname and domain name): Isolate system identifiers from gethostname and getdomainname
IPC (System V IPC, POSIX message queues): Isolate interprocess communication resources like message queues.
Cgroup (Cgroup root directory): Here for completeness, but minor: limits the view of other cgroups

When docker instantiates an image into a container, it must:

create the process
set cgroup limits
move the process into new namespaces and new root
instantiate the namespaces and environment, particularly mounting the filesystem from the image and setting up the network to link to the external world.

But at the end of the day, it’s just a Linux process.

This article was also pretty helpful, it shows a simplified container by creating a child process, setting the namespace and pivoting the root.

More?

I’m finding I learn best by finding a mental model for a technology, and this is how I view containers. I also tried running strace on docker when starting docker run hello-world and thankfully found nothing too unexpected:

# Setting some cgroups
[pid 20418] mkdirat(AT_FDCWD, "/sys/fs/cgroup/memory/docker/03dd30fd0992dccf8d166d7963f150c96286be38e05c29ccb2947f2e317564f3", 0755) = 0
[pid 20418] openat(AT_FDCWD, "/sys/fs/cgroup/memory/docker/03dd30fd0992dccf8d166d7963f150c96286be38e05c29ccb2947f2e317564f3/cgroup.procs", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0700) = 6
[pid 20418] fcntl(6, F_GETFL)           = 0x8001 (flags O_WRONLY|O_LARGEFILE)
[pid 20418] fcntl(6, F_SETFL, O_WRONLY|O_NONBLOCK|O_LARGEFILE) = 0
[pid 20418] write(6, "20419", 5)        = 5
...
# Clone and create new namespaces
[pid 20419] clone(strace: Process 20420 attached
 <unfinished ...>
[pid 20420] getpid()                    = 20420
[pid 20420] close(6)                    = 0
[pid 20420] prctl(PR_SET_NAME, "runc:[1:CHILD]\0r") = 0
[pid 20420] unshare(CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWPID|CLONE_NEWNET) = 0
...
# Mounting into image fs? Setting up environment?
[pid 20422] mount("/var/lib/docker/overlay2/2abdd3efa33d576f71671374bd4a8bfd9f1e5a90f7d9e4ea67573ec18d68c1f6/merged", "/var/lib/docker/overlay2/2abdd3efa33d576f71671374bd4a8bfd9f1e5a90f7d9e4ea67573ec18d68c1f6/merged", 0xc4200e5e20, MS_BIND|MS_REC, NULL <unfinished ...>
[pid 20422] <... mount resumed> )       = 0
[pid 20422] stat("/var/lib/docker/overlay2/2abdd3efa33d576f71671374bd4a8bfd9f1e5a90f7d9e4ea67573ec18d68c1f6/merged/proc", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[pid 20422] mount("proc", "/var/lib/docker/overlay2/2abdd3efa33d576f71671374bd4a8bfd9f1e5a90f7d9e4ea67573ec18d68c1f6/merged/proc", "proc", MS_NOSUID|MS_NODEV|MS_NOEXEC, NULL) = 0
[pid 20422] stat("/var/lib/docker/overlay2/2abdd3efa33d576f71671374bd4a8bfd9f1e5a90f7d9e4ea67573ec18d68c1f6/merged/dev", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
...

There’s obviously a lot more if you want to dig into it, even as noted by the runc and containerd in the strace output.

Seena Burns

DEVLOG

Jan 15 - Linux Containers: What, Why, How

What problems do containers address?

Packaging

Runtime Isolation

Other benefits

How they work

Packaging

Runtime Isolation

Cgroups

Linux namespaces

More?