Flaky test root causes + introducing Reproducible Containers

(Try Cloudseal here - alpha prerelease)

Any programmer using a psuedorandom number generator knows that random is not the same thing as uncontrolled. Yet, when it comes to our software, we cede quite a bit of control -- allowing the processor and operating system to determine what order our code runs in, and many other small details of execution.

For example, even the trivial Bash commands below demonstrate how easy it is to tap into the underlying nondeterminism seething inside your computer:

$ function prnt { for ((i=0; i<300; i++)); do echo -n $1; done; echo; }
$ (prnt a &); prnt b; wait
$ (prnt a &); prnt b; wait

In this case, the same command printed different output on the second run because of operating system scheduling decisions. Multiply the complexity by a thousand and you have the problem of flaky tests as commonly found in user-interface testing, service testing, or end-to-end (E2E) testing. Here we define a flaky test is one where the same version of the code yields both a passing and failing result across multiple runs. For example, given the two repeated bash commands above, what if the first led to a crash downstream, and not the second? How could we reproduce that unique snowflake of software execution?

Flaky tests: Bemoan, Label, Mitigate

Flaky tests are a "a virulent infection that can completely ruin your entire test suite", and a "a war that never ends". At one point, the Google Go team"completely lost trust in [their] build" due to flaky tests. And flaky tests occur commonly, quoting Dropbox engineers: "Flakiness is the most common problem with E2E tests".

But many discussions of flaky tests essentially amount to tips and tricks for mitigating them. Manual fixes include turning off these tests (see no evil) or repeating them until they pass. In fact, Google's Bazel build system even supports this tactic with an explicit notion of flakiness with the flake=1 attribute, which enables "rerun till it works". Other common tips include avoiding time-based "wait"/sleep in tests, avoiding the system clock entirely, or adding more mocking instead of depending on external services.

Where's the technical solution?

Manual mitigations are useful, but they require constant diligence: developer effort that scales with the number of tests themselves. In this blog, we will argue that the problem of flaky tests is at least partially amenable to a once-and-for-all technical solution. In particular, a technical solution can address the failures of test isolation that permit information to leak unaccountably from the test environment into the test itself.

The idea is that a test with fully controlled inputs should always yield a stable result. For example, a server evaluating arithmetic expressions, when given 2+2 should always respond 4, because 2+2=4 is a declarative fact that will always hold, and whose inputs are fully specified. The challenge is turning the uncontrolled, implicit inputs of realistic programs into controlled ones.

Above we saw how OS scheduling decisions affected the output of our trivial Bash example, and yet those scheduling decisions are not part of the test harness, and they certainly are not version controlled with the code. Even suggesting that OS scheduling be version controlled might seem mildly ridiculous. We software engineers are acclimated to nondeterministic computers, and we expect any test could be influenced by:

  • operating system scheduling
  • system time
  • irreproducible properties of the file system
  • identifiers for processes, users, groups
  • randomized memory management
  • newer processor model

Even if we build mocks for external services, we don't virtualize all of these external factors that affect our test results. But... what if we did? Virtualization technology has been around for a long time — what if we applied virtualization concepts to intercepting and taming all the above sources of influence on our test cases? That's exactly what what we've done in our academic research over the last several years, culminating in something we call a reproducible container.

Reproducible containers

A reproducible container runs existing binary programs (Linux/x86 programs in our case), but it guarantees that their outputs only change if their explicit inputs do. Traditional virtualization and emulators allow us to simulate one OS or processor on another. Just so with reproducible containers, except we simulate a deterministic machine on a nondeterministic one. Thus, building a reproducible container requires defining an alternative deterministic Linux and deterministic x86, and then emulating them efficiently on the machines we actually have. There are many technical methods of doing that. Our particular solution runs in user-space on top of commodity kernels, in contrast with previous academic work that modified Linux, or created a custom OS.

Our version of reproducible containers can work with Docker, but in its simplest form, we can prefix commands with det to isolate them, and run them reproducibly on any Linux system. For example, our previous Bash program:

$ det bash -c 'function prnt { for ((i=0; i<300; i++)); do echo -n $1; done; echo; }; (prnt a &); prnt b; wait'

Every time we run it, we must get the same output, byte-for-byte. The "input" in this case comprises the files read by the process tree, including /bin/bash. In future posts we'll discuss how we can hash all the inputs and outputs of a computation to give a unique "fingerprint" to the computation, and validate that it always produces the same bits of output.

But what deterministic scheduling policy did the above execution actually use? Turning nondeterministic POSIX semantics into deterministic ones requires picking canonical answers among a set of allowable ones. In the above example, the default was biased towards sequential scheduling with low preemption. More advanced invocations of a reproducible container take a container configuration file, which includes, among other things, a seed for psuedorandom scheduler decisions.

When we version control this configuration file, we are literally versioning the scheduler decisions that go along with the tests. We'll get more into the details of container configuration in future posts, but roughly the same story applies to other sources of randomness like the x86-64 rdrnd and rdtsc instructions, or the /dev/urandom file in Linux. That is, they're all "allowed", but they are accounted for in the version-controlled container configuration.

The reason we believe reproducible containers are a promising technology for software testing is that the proposition "my test passed under configuration X" becomes a declarative fact like 2+2=4 — reproducible by anyone, anywhere, at any time. The goal is separating facts from ephemera. Yet there are some challenges and objections to doing this that we'll need to address.

First objection: my program should be robust to these variations

I hear you. At the end of the day, your program must be correct in spite of a large envelope of allowable behavior by external, nondeterministic services, not to mention the OS and processor. It cannot depend one just one behavior, whether or not that is controllable in the test environment. To this end, randomized testing is of course a good approach. Don't run your program once, with a canonical schedule. Run it 100 or 1000 times varying any factors that it must be robust to.

In fact, we believe that reproducible containers actually help with this goal, not hinder it. Again, as we noted at the top, random is not the same thing as uncontrolled. Indeed, in property-based testing, popularized by QuickCheck, we use randomness to test against more and larger inputs, but it is essential that the seeding of this randomness be under our control. When we observe a failure under QuickCheck, we must be able to recreate the conditions that caused it.

So too with the system-level behaviors managed by reproducible containers. We could run our trivial Bash example with 100 different OS schedules, based on different scheduler seeds. If one out of 100 fails, then we want to reproduce that schedule in particular to debug it. The fact that this one execution fails should indeed be a declarative fact about the universe, as timeless as 2+2=4.

Finally, note that explicitly fuzzing schedules is actually a much better tactic for exposing bugs than simply running a program again and again. Repeating uncontrolled executions in a loop may take an enormous number of iterations to expose a particular interleaving, if ever. That's why Mozilla rr Chaos mode is great at exposing bugs, and likewise with Chaos engineering in general.

About Cloudseal

As the inaugural post on a new blog, I should mention who we are. Cloudseal is a new startup in the software testing & bug reproduction space, bringing to market a product based on the academic work we've done on reproducible execution.

Try Cloudseal here - alpha prerelease

Ryan NewtonComment