I’ve wanted to understand more about the process of how source code gets compiled and packaged with its dependencies into a deployable artifact. I’m starting with C, since most things either follow the C way of doing things or get compared to it.

I’d like to start filling in some gaps in my knowledge like:

File types

When dealing with C, we have four different types of files:

But wait, here’s another file type for free!

Building our source code

Building our code is the process of taking our source code to an executable. Without digging into the internals of a compiler, for C this involves:

The build C process is actually even simpler logically than the file types we mentioned.

Header files are just source files that get preprocessed into other files, not a separate concept. By convention, header files contain just function declarations, but you can include anything a source file can, and people commonly do (like with single header file libraries).

Libraries, again, are just object files packaged together. A library is like an uncompressed zip or tar archive and I like to think of it as a bunch of object files cat‘d together with an index at the top.

So really what we have are source files (source and header files), intermediates (object files and libraries) and the final target (binaries). You may consider a library the final target of your build depending on if you’re building an executable or not.

Building in action

Let’s check out how this maps to the simplest of examples:

// main.c
int main() {
  return 0;
}
# compile main.c into an object file, main.o
gcc -c main.c
# link main.o into an executable
gcc main.o -o main

Cool! We’ve compiled a source file into an object file (main.o) and then linked it into an executable (main).

Single source dependency

Okay now let’s add a source file dependency:

// add.h
int add(int a, int b);
// add.c
int add(int a, int b) {
  return a + b;
}
// main.c
#include "add.h"

int main() {
  return add(0,1);
}

We can use gcc -E to see the output of the preprocessor:

> gcc -E main
# 1 "main.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 1 "<command-line>" 2
# 1 "main.c"

# 1 "add.h" 1
int add(int a, int b);
# 3 "main.c" 2

int main() {
  return add(0,1);
}

I haven’t dug into what all the output is, but we can see that the preprocessor copies add.h into main.c as we thought.

However, using the same compile commands fails:

> gcc -c main.c
> gcc main.o -o main
main.o: In function `main':
main.c:(.text+0xf): undefined reference to `add'
collect2: error: ld returned 1 exit status

Let’s run nm on main.o to see what symbols are used.

# nm shows symbols in a object file
# man nm shows all the symbol types
# briefly T = symbol is in the code section, U = undefined
> nm main.o
                 U add
0000000000000000 T main

Here we see add is undefined, which makes sense since we never compiled the add function to binary. We need to go through the same process to compile add.c into an object file and then link it with main.o.

# compile object files
gcc -c main.c
gcc -c add.c
# link
gcc main.o add.o -o main

Building our own static library

Now let’s add mult.h/c and build our own static library.

// mult.h
int mult(int a, int b);
// mult.c
int mult(int a, int b) {
  return a * b;
}

Before we would have to do something like:

# compile object files
gcc -c main.c
gcc -c add.c
gcc -c mult.c
# link
gcc main.o add.o mult.o -o main

But now we will package add.o and mult.o into a single library:

gcc -c main.c
gcc -c add.c
gcc -c mult.c
# create library
ar rcs libmath.a add.o mult.o
# link
gcc main.o libmath.a -o main

ar creates an archive from our object files and s makes it include an index. Let’s run nm on it:

> nm libmath.a
Archive index:
add in add.o
mult in mult.o

add.o:
0000000000000000 T add

mult.o:
0000000000000000 T mult

So it looks like what we expected, it includes an index from symbol to object file and then the contents of each object file. We end up using it exactly the same as an object file when linking.

Building our own dynamic library

Dynamic libraries (aka shared libraries) do change things a little, they let us defer symbol resolution until runtime. This lets us do cool stuff like hot reloading code, and letting multiple binaries load the same shared library.

Continuing from the same example before, our compiling now looks like:

# compile object files
# -fPIC makes it position independent
# positions are relative, so it can be relocated in memory when loaded
gcc -c main.c
gcc -c -fPIC add.c
gcc -c -fPIC mult.c
# create library
gcc -shared -o libmath.so.1 add.o mult.o
# link
# -L. adds the current dir to the library search path
# you can also use -lmath to link libmath.so
gcc main.o -o main -L. -l:libmath.so.1

When running we need to also specify the library search path (where the loader looks for dynamic libraries):

# Run
> LD_LIBRARY_PATH=. ./main

# Show dynamic library dependency resolution
> LD_LIBRARY_PATH=. ldd main
...
    libmath.so.1 => ./libmath.so.1 (0x00007fa023369000)
...

Dynamic loading is a pretty big topic of its own, but it still serves the same purpose of resolving symbols like an object file, just with some magic so we can do that after compile time. Unfortunately, this complicates deploying build artifacts since you need to have the library in place with the final binary.

Printing and libc

We’re going to get a little crazy here and actually output text. This time we’re just going to have main.c but include stdio.h.

// main.c
#include <stdio.h> // puts

int main() {
  puts("Hello");
}
# compile main.c into an object file, main.o
gcc -c main.c
# link main.o into an executable
gcc main.o -o main
./main
# outputs: Hello

We never defined stdio.h or puts but everything works fine. Running gcc -E main.c produces an enormous output but it looks like stdio.h is coming from somewhere. Lets run nm on the object file and the binary to see the symbols in each.

> nm main.o
0000000000000000 T main
                 U puts
> nm main
...
U __libc_start_main@@GLIBC_2.2.5
0000000000400526 T main
                 U puts@@GLIBC_2.2.5
                 00000000004004a0 t register_tm_clones
...

Looks like puts is referenced but not defined in main.o, and nm main points us to GLIBC. Libc is the standard library for c, and glibc is the implementation that gcc includes. It turns out this gets implicitly dynamically link on every build.

Running ldd on the gcc output shows us dynamic library dependencies (also called shared objects), confirming gcc is linking more than just our main.o

# ldd prints "shared object dependencies" (dynamic libraries)
> ldd main
    linux-vdso.so.1 =>  (0x00007fff945b0000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f0c8d0f7000)
    /lib64/ld-linux-x86-64.so.2 (0x0000556bcd7fb000)

GCC is doing a lot more than just calling the linker ld with ld main.o -o main, we can see it all with gcc -v main.o -o main… it’s a lot. It seems hard to make ld work directly because of all the libraries we need to link against to actually make a C executable.

So even for 4 lines of code, we’ve got a lot going on. We found out GCC is doing a lot implicitly to build an executable that we glossed over before. Apparently we need vdso (lib to attempt to use faster hardware instructions for system calls?), libc (standard c library) and ld-linux (dynamic linker/loader).

BUT, it does still fall under our mental model. We build main.c into main.o, which has some undefined references. In order to make an executable, we combine our intermediate object files with libc dynamically (and a loader) and if every symbol is resolved, it works! It’s the same as the dynamic library example, with just implicit stuff happening that probably makes reliable building a headache.

Wrapping up

I’ve learned a lot about C builds from this, and I’m curious to see what other languages do. Thankfully, the mental model of source to object to binary (or library) target is pretty straightforward, even if we ended up doing a lot of work digging into some really simple builds.