I’ve wanted to understand more about the process of how source code gets compiled and packaged with its dependencies into a deployable artifact. I’m starting with C, since most things either follow the C way of doing things or get compared to it.
I’d like to start filling in some gaps in my knowledge like:
- What are the steps of building a C program? A compiler? Linker? What else?
- What are those .o files that come out?
- How does source code depend on other files? How does the compiler package dependencies?
- What are libraries? How are they different, how do I make them? Dynamic vs static?
File types
When dealing with C, we have four different types of files:
-
Source code (
*.c
files)Source files contain function definitions
-
Header files (
*.h
files)If we don’t define a function signature before using it, the compiler will complain. We include header files, which contain function declarations, when we want source files to reference externally defined functions.
-
Object files (
*.o
files)Object files are the output of a compiler. They are contain function definitions in binary form (machine code), but haven’t been packaged into an executable yet and may contain references to symbols.
-
Binary executables
Executables are the output of the linker, which links a number of object files together to form a file that can be directly executed. Sometimes just called binaries.
But wait, here’s another file type for free!
-
Libraries (
.a
for static libraries,*.so
for dynamic libraries)Libraries are just object files joined together into one file. Conceptually, they do they same thing as object files: they contain binary forms of function definitions. They can be linked with other object files and libraries to form a binary.
Static libraries are packaged into the executable at compile time like other object files. Dynamic libraries let us defer loading until runtime.
Building our source code
Building our code is the process of taking our source code to an executable. Without digging into the internals of a compiler, for C this involves:
-
Preprocessor (source/header files to expanded source)
The preprocessor is responsible for transforming source code as indicated by the preprocessor directives. For example the preprocessor replaces the line
#include "header.h
with the entire contents ofheader.h
.#define
is another common directive used for macros and constants, where the preprocessor can replace all instances of a defined keyword.The compiler invokes the preprocessor automatically before it runs, so all it sees are the processed source files.
-
Compiler (expanded source -> object files)
With the processed source code, the compiler turns source code into binary versions of the source code, the object files. Object files can be packaged together into a library by a separate tool.
-
Linker (object files -> executable)
The linker takes object files and libraries and combines them into an executable, resolving any external symbols in the process.
The build C process is actually even simpler logically than the file types we mentioned.
Header files are just source files that get preprocessed into other files, not a separate concept. By convention, header files contain just function declarations, but you can include anything a source file can, and people commonly do (like with single header file libraries).
Libraries, again, are just object files packaged together. A library is like an uncompressed zip or
tar archive and I like to think of it as a bunch of object files cat
‘d together with an index at
the top.
So really what we have are source files (source and header files), intermediates (object files and libraries) and the final target (binaries). You may consider a library the final target of your build depending on if you’re building an executable or not.
Building in action
Let’s check out how this maps to the simplest of examples:
// main.c
int main() {
return 0;
}
# compile main.c into an object file, main.o
gcc -c main.c
# link main.o into an executable
gcc main.o -o main
Cool! We’ve compiled a source file into an object file (main.o
) and then linked it into an
executable (main
).
Single source dependency
Okay now let’s add a source file dependency:
// add.h
int add(int a, int b);
// add.c
int add(int a, int b) {
return a + b;
}
// main.c
#include "add.h"
int main() {
return add(0,1);
}
We can use gcc -E
to see the output of the preprocessor:
> gcc -E main
# 1 "main.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 1 "<command-line>" 2
# 1 "main.c"
# 1 "add.h" 1
int add(int a, int b);
# 3 "main.c" 2
int main() {
return add(0,1);
}
I haven’t dug into what all the output is, but we can see that the preprocessor copies add.h
into
main.c
as we thought.
However, using the same compile commands fails:
> gcc -c main.c
> gcc main.o -o main
main.o: In function `main':
main.c:(.text+0xf): undefined reference to `add'
collect2: error: ld returned 1 exit status
Let’s run nm
on main.o
to see what symbols are used.
# nm shows symbols in a object file
# man nm shows all the symbol types
# briefly T = symbol is in the code section, U = undefined
> nm main.o
U add
0000000000000000 T main
Here we see add
is undefined, which makes sense since we never compiled the add function to
binary. We need to go through the same process to compile add.c
into an object file and then link
it with main.o
.
# compile object files
gcc -c main.c
gcc -c add.c
# link
gcc main.o add.o -o main
Building our own static library
Now let’s add mult.h/c
and build our own static library.
// mult.h
int mult(int a, int b);
// mult.c
int mult(int a, int b) {
return a * b;
}
Before we would have to do something like:
# compile object files
gcc -c main.c
gcc -c add.c
gcc -c mult.c
# link
gcc main.o add.o mult.o -o main
But now we will package add.o
and mult.o
into a single library:
gcc -c main.c
gcc -c add.c
gcc -c mult.c
# create library
ar rcs libmath.a add.o mult.o
# link
gcc main.o libmath.a -o main
ar creates an archive from our object files and s
makes it
include an index. Let’s run nm
on it:
> nm libmath.a
Archive index:
add in add.o
mult in mult.o
add.o:
0000000000000000 T add
mult.o:
0000000000000000 T mult
So it looks like what we expected, it includes an index from symbol to object file and then the contents of each object file. We end up using it exactly the same as an object file when linking.
Building our own dynamic library
Dynamic libraries (aka shared libraries) do change things a little, they let us defer symbol resolution until runtime. This lets us do cool stuff like hot reloading code, and letting multiple binaries load the same shared library.
Continuing from the same example before, our compiling now looks like:
# compile object files
# -fPIC makes it position independent
# positions are relative, so it can be relocated in memory when loaded
gcc -c main.c
gcc -c -fPIC add.c
gcc -c -fPIC mult.c
# create library
gcc -shared -o libmath.so.1 add.o mult.o
# link
# -L. adds the current dir to the library search path
# you can also use -lmath to link libmath.so
gcc main.o -o main -L. -l:libmath.so.1
When running we need to also specify the library search path (where the loader looks for dynamic libraries):
# Run
> LD_LIBRARY_PATH=. ./main
# Show dynamic library dependency resolution
> LD_LIBRARY_PATH=. ldd main
...
libmath.so.1 => ./libmath.so.1 (0x00007fa023369000)
...
Dynamic loading is a pretty big topic of its own, but it still serves the same purpose of resolving symbols like an object file, just with some magic so we can do that after compile time. Unfortunately, this complicates deploying build artifacts since you need to have the library in place with the final binary.
Printing and libc
We’re going to get a little crazy here and actually output text. This time we’re just going to have
main.c
but include stdio.h
.
// main.c
#include <stdio.h> // puts
int main() {
puts("Hello");
}
# compile main.c into an object file, main.o
gcc -c main.c
# link main.o into an executable
gcc main.o -o main
./main
# outputs: Hello
We never defined stdio.h
or puts
but everything works fine. Running gcc -E main.c
produces an
enormous output but it looks like stdio.h
is coming from somewhere. Lets run nm
on the object
file and the binary to see the symbols in each.
> nm main.o
0000000000000000 T main
U puts
> nm main
...
U __libc_start_main@@GLIBC_2.2.5
0000000000400526 T main
U puts@@GLIBC_2.2.5
00000000004004a0 t register_tm_clones
...
Looks like puts
is referenced but not defined in main.o
, and nm main
points us to GLIBC
.
Libc is the standard library for c, and glibc is the implementation that gcc includes. It turns out
this gets implicitly dynamically link on every build.
Running ldd
on the gcc output shows us dynamic library dependencies (also called shared objects),
confirming gcc is linking more than just our main.o
# ldd prints "shared object dependencies" (dynamic libraries)
> ldd main
linux-vdso.so.1 => (0x00007fff945b0000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f0c8d0f7000)
/lib64/ld-linux-x86-64.so.2 (0x0000556bcd7fb000)
GCC is doing a lot more than just calling the linker ld with ld main.o -o main
, we can see it all
with gcc -v main.o -o main
… it’s a lot. It seems hard to make ld
work directly because of all
the libraries we need to link against to actually make a C executable.
So even for 4 lines of code, we’ve got a lot going on. We found out GCC is doing a lot implicitly to build an executable that we glossed over before. Apparently we need vdso (lib to attempt to use faster hardware instructions for system calls?), libc (standard c library) and ld-linux (dynamic linker/loader).
BUT, it does still fall under our mental model. We build main.c
into main.o
, which has some
undefined references. In order to make an executable, we combine our intermediate object files with
libc dynamically (and a loader) and if every symbol is resolved, it works! It’s the same as the
dynamic library example, with just implicit stuff happening that probably makes reliable building a
headache.
Wrapping up
I’ve learned a lot about C builds from this, and I’m curious to see what other languages do. Thankfully, the mental model of source to object to binary (or library) target is pretty straightforward, even if we ended up doing a lot of work digging into some really simple builds.