Introduction to Reproducible Builds

Reproducible software builds are an increasingly popular topic in the software development community, and a clear example of the value of reproducible software. A reproducible build is one that produces identical build outputs (executables, libraries, documentation files, etc.) given identical source code.

Reproducibility of software builds seems like an obvious property that all builds should provide by default. Unfortunately, modern compilers and other components of the software toolchain make it very easy to introduce reproducibility problems unwittingly. Consider this (simplest?) C program, simple.c:

int main(int argc, char** argv) {
  return 0;
}

If we compile this simple program with debugging symbols enabled, via clang simple.c -g -o simple.bin, several pieces of irreproducible information get embedded into the resulting binary. For example, clang generates a temporary object file during the compilation, and gives the file a randomly-generated name (like simple-3b81f6.o). This name is placed in the debug symbols that are embedded in the binary. Two consecutive builds with the same source code will yield different binaries. gcc operates the same way, unfortunately, so switching compilers won’t help us.

Even if we control partial compilation ourselves to avoid the temporary object file, the debugging symbols also contain absolute paths to source code files. So if I build my code once in directory A and again in directory B, I will again produce distinct binaries even though the source code is the same.

Software build artifacts often contain all kinds of irreproducible information: auto-generated timestamps indicating when the software was compiled (documentation generators like Doxygen often do this by default), randomness to generate unique names for temporary files, or files that record the processor or OS version used on the build machine.

Why do we need reproducible builds?

Later on we describe some of the more convoluted sources of irreproducibility we’ve run across. But given so many reasons that a build can fail to be reproducible, do we really even need reproducibility in the first place?

One key motivation for reproducible builds is to enable peak efficiency for the build caches used in modern build systems. If an object file generated from a source code file is always the same, we can avoid re-building this object file if its source hasn't changed, accelerating our build. However, if the object file is always different due to an embedded timestamp or other spurious difference, then we'll waste time and energy building the same source file over and over again. While making our software build reproducibly, we would be prompted to fix this timestamp issue, and would see improved build times as a result.

Another important motivation for reproducible builds is security against attacks on the software “supply chain”. The core issue is: how do we know that a binary program actually originates from the source files for that program? Chris Lamb from the Debian project gave a nice talk about this at SFScon18. He observes that the vast majority of software (even open-source software) is distributed in binary-only form, and there is a huge amount of trust placed in these binaries. Instead of getting a backdoor into the source code for, say, nginx, an attacker can instead compromise the toolchain, or the working copy of the source code, on the machine used to produce the official nginx binary package. No amount of auditing of the official source code will reveal this flaw, since it doesn’t exist there. If nginx doesn’t build reproducibly, then we might expect the official binary to differ from one that I built myself, so I won’t be suspicious of the official binary. If nginx does build reproducibly, however, then we have a way of detecting this attack, as lots of people will build nginx and produce the same binary package as each other, but one that differs from the official version. This raises an immediate red flag to remove the official package and audit the machine that was used to build it.

In his talk, Chris also discusses how reproducible builds can result in more meaningful binary diffs where small source code changes result in small binary changes. This can also be used to increase our confidence that our compiler is working as expected.

Current reproducible build efforts

To achieve the security benefits of reproducible builds, there are several projects underway in the open-source community and beyond to achieve reproducible builds. The Debian project spearheaded this effort with its Debian Reproducible Builds effort, which has been adopted by several other Linux and BSD distros and software like Tor where build integrity is critical. Microsoft has added reproducibility features to their C# and Visual Basic compilers. Build systems these days increasingly integrate dependency management, as with Google’s Bazel, Rust’s cargo, or Java’s maven, which helps improve reproducibility by ensuring that the same versions of dependencies are used for each build.

How have people achieved reproducible builds in the past?

The first step towards making a build reproducible is to identify the root cause(s) of its irreproroducibility. This is typically a laborious process, as builds can be irreproducible for very subtle reasons. For example, when identifying the root cause of irreproducibility, we’ve found that TeX-based documentation is a particularly common source. Some TeX tools record their own installation time, which is later embedded in documentation generated by those tools. In other cases we’ve seen the precise order of entries in TeX’s various databases have an effect on unique identifiers that appear within PDFs generated from TeX. Yikes!

A package with TeX-based documentation that triggers one of these irreproducibility issues will remain irreproducible until the TeX tools are fixed, or the documentation is (somewhat gratuitously) ported to an alternate format that yields reproducible outputs. A similar issue exists with compilers that embed local filesystem paths in debug metadata: special flags had to be added to clang/gcc to suppress this behavior, and software packages need to update their build scripts to use these flags.

Even when patches exist to make a package reproducible, they sometimes languish unmerged when package maintainers don’t see value in reproducibility. Ultimately, tackling each irreproducible package individually is like weeding a garden by hand. It’s a painstaking process: years of effort by dozens of developers have resulted in about 93% of Debian packages building reproducibly.

Cloudseal: a container-based approach to reproducibility

Cloudseal’s reproducible container technology instead provides a generic solution to reproducibility for existing software. Instead of weeding by hand, using a Cloudseal container is like moving into a greenhouse where weeds, and many other aspects of the environment, can be precisely controlled. Rather than patching every software package that uses debug flags, using a Cloudseal container for the build forces the build to run reproducibly without any changes to the package’s source code. Our earlier blog posts describe how our container works in more detail. The summary is that the Linux system call interface and x86 instruction set are the only ways that irreproducibility can leak into an execution. While these are broad interfaces, our container intercepts the set of Linux system calls and small number of x86 instructions that can introduce irreproducibility, and gives each a reproducible outcome instead.

While we still have work to do to improve our prototype’s support for all of Linux’s system calls, in our initial experiments we’ve achieved 100% reproducibility for over ten thousand unmodified Debian packages. Cloudseal containers provide a foundation of reproducibility for all the software they run, enabling the benefits of reproducibility with minimal developer effort.

Joseph Devietti