Now that we have a good sense of how how C builds artifacts, I want to survey some other languages and ecosystems to see the similarities and differences. I’m hoping by the end, I have a good idea of what characterizes builds and artifacts in a language, without prematurely diving into details (like how dynamic library search paths are in C). Here’s what I’m thinking:
- What does compilation look like? What are the targets and intermediates?
- How are external references resolved? What does linking and loading look like?
- How are dependencies included in a build, what work does a build system have to do to manage them?
- Are there any quirks?
Let’s look at some languages I’ve run into: Java, Python, and Go.
Java (and other JVM Languages)
Java (and other JVM languages like Scala and Clojure) compile to Java bytecode, rather than targeting a particular OS/available hardware directly like C. The Java Virtual Machine executes this bytecode since it is not directly executable. The JVM abstracts platform details like system calls, giving Java its famed portability: bytecode can be run anywhere the JVM is available.
Java follows the model of:
- Source files (
Class files (compiled bytecode,
Rather than interpret source code directly, the Java compiler (
javac) compiles source into Java bytecode. This handles most of the heavy lifting (e.g. parsing, semantic analysis, optimizations) and produces a more simpler output, making it fast for the JVM to stream-read bytecode and execute it.
Class files are analogous to object files in C: they contain the program in compiled form.
JARs (Java ARchive,
With C we saw it’s a pain to manage tons of object files, so we packaged them up into an archive like a static or dynamic library by literally joining object files into a single file.
We have the same problem with class files, so we turn to archives again! A JAR packages class files and metadata together. Like C archives, JARs are really simple: it’s just a ZIP! You can even
unzip foo.jarto get back the contents.
The JVM itself (
In C, the final step was to link object files and libraries to create an executable binary (e.g. an ELF file). With Java, we simply distribute class files and JARs and point the JVM at the class files we want to execute.
In C, linking was the process of resolving external references into the corresponding compiled code.
Static and dynamic libraries let us choose between resolving at compile time vs runtime. In Java,
object files aren’t linked into a standalone artifact. Instead, all classes are loaded at runtime
(including the Java runtime classes), typically on demand, by classloaders. Classloaders define how
to resolve a classname into the respective bytecode (e.g. the system classloader defines where it
expects to fine classes, like wanting
my.package.Class to be somewhere in the
My/Package/Class.class). Users can write their own classloaders too.
Java code is typically distributed as a JAR. Managing classloading with JARs is typically a headache: a JAR can define a classpath in its metadata, but cannot reference a JAR within itself. Classloading needs to be carefully managed, leading to build tools to automate or repackage all JARs into a single fat jar (like one-jar or maven assembly).
- Interpreted versus native code is a result of implementation not language; a compiler can compile Java to target native code
- Interpreted code is still generally compiled to a simpler, optimized bytecode which is much like the role of compiled object files, but not necessarily
- Interpreted languages use another executable to evaluate the bytecode.
- There isn’t a clear parallel to linking in C, external references are just loaded dynamically by classloaders.
- Classloaders define how to find bytecode for a class. This seems really similar to searching a system environment for a dynamic library, but classloading tends to be a notorious ability/quirk of Java.
Python follows a very similar model to Java: Source is read by the Python executable, which compiles
it to bytecode (cached as
*.pyc files) and runs bytecode on the Python Virtual Machine.
Modules are dynamically loaded in Python.
import foo will find and load a module on demand,
compiling and running any references. Similar to the Java classloader, Python has its own search
rules for resolving a module into the code for it.
Python does a lot less validation when compiling bytecode. I think this might be part of the “dynamic” nature of Python. For modules, where C and Java check external references when compiling, in Python you can get away with something like this:
# bad-import.py def foo(): import bogus return "" print("Hello!")
# Evaluates despite missing module > python3 bad-import.py Hello!
Python has different archive formats for distributing code and dependencies (e.g. wheels, eggs), but
as far as I can tell, external dependencies are included in your program just by having the source
or bytecode at one of the module search paths. For example,
pip show requests tells me
is installed at
~/Library/Python/2.7/lib/python/site-packages/requests/ which is on the module
sys.path). There I see an
__init__.py and other source files like
session.py. Even though the Python Packaging User Guide is
Python specific, it’s also interesting to see what problems exist in packaging (e.g. packaging
non-Python files, specifying system/Python compatibility).
- Python looks like Java, but compiles bytecode at runtime and validates very little.
- Loading a module is like classloading: a module name is resolved into source/bytecode by search rules
- Python defers as much as possible, compiling when loading, resolving on actual execution.
Go looks really similar to C! It compiles to native code.
Source files (
Source files can also be part of a package, which gets compiled into an archive and defines a namespace.
Compiled files (
Go compiles object files (
*.ofiles) and packages them together into a single archive like
pkg/<arch>/foo/bar/lib.ain your workspace. This archive (this time literally a tar!) acts like a static library in C, but the Go build tooling handles all of this implicitly, so developers don’t really deal with this.
When it comes time to build a Go binary, Go links your program with all of its package dependencies. Even the Go runtime (e.g. garbage collection, memory allocation, etc) is implemented as what seems like a Go package. Once linked, you have an executable binary.
Linking executables looks much like C, but historically Go was statically linked. In more
recent versions of Go, executables can be dynamically
os/user packages are used or when interfacing with C. You can also compile Go into a dynamic
library (shared object file) for use in C programs which is pretty neat!
- Go builds are conceptually similar to C. Typically source to compiled object files, packages are static libraries, static libraries are linked into an executable.
- Go pushed for static linking, making it easier to distribute applications. Nowadays, Go may be dynamically linked as well.
- Go tooling has many opinionated conventions, even just workspace layout. So far, all languages have similar expectations for things like loading search paths, but Go takes it further to try to make the build process smooth
- Go tends to be easy to distribute and cross compile
By the end this started to be repetitive, which is cool because I feel it hows we have a good mental model, even across languages. In each we’ve seen source code, compiled version (object files, bytecode files), archives (just zips and tars!) for distribution and packaging, linking and loading (and their search paths) for resolving external references.
There’s still a lot beyond this cursory overview though. I spent way too much time reading about classloaders, JARs and how the Java runtime is bootstrapped. Python has it’s own rules about loading which makes things like virtualenv possible (apparently Python searches from its install location up directories until it find the python libraries?). But, I think with decent understanding of where things fit, these are details and it’s good to leave these until you need it.