Deterministic Execution vs Record-and-Replay

One question we get a lot is how deterministic execution (DE) differs from record-and-replay (RnR) technology. In brief, the former is about predictability given the same starting state, and the latter is about doing something unpredictable, but logging every detail. Here, we'll go through some examples that illustrate the differences in usage, though we'll ultimately conclude that the two approaches exist on a spectrum.

RnR tools include Mozilla rr, Intel PinPlay, and several academic systems. For this post, let's consider rr as our example RnR system. Mozilla rr is a practical debugging tool that C/C++ programmers should definitely consider using today. rr traces how application threads and processes are executed by the OS, recording that information in a trace file. You can run in either a record or replay mode:

$ rr record ./myapp.bin 
$ rr replay

The fundamental purpose of RnR is to reliably recreate the internal states of the first execution during the replay phase. This, of course, doesn't guarantee anything about a new run / new recording of myapp.bin. In contrast, that's exactly what a deterministic execution does:

$ det ./myapp.bin 
$ det ./myapp.bin # same initial state guarantees same output

Here, the guarantee is that two different executions must behave in the same way, as long as input files remain the same. For example, if we start with the same git checkout, we can guarantee that we compile binaries with the same hash, or that the functional tests will yield the same test results.

That's it; that's the gist of the difference: two different usages with distinct guarantees and purposes. And yet there are many similarities between RnR and DE technologies. Both RnR and DE require similar interception mechanisms to sandbox program executions. They can both be judged on:

  • the overhead of sandboxing,
  • the generality of their support, and
  • the portability of the traces and determinism guarantee, respectively.

How RnR Works

During record, rr intercepts system calls and some nondeterministic instructions, and records their outputs inside a binary trace file, which in rr's case is stored in a Cap'N Proto format. An RnR system must establish a boundary between what's "inside" and assumed to be reproducible, vs what is "outside", where all interactions across the boundary must be recorded. For rr, this boundary crossing occurs at most syscalls, plus certain nondeterministic instructions, plus any preemption points where the OS takes control from the appliaction.

In contrast with low-level architectural simulators, rr doesn't bother recording individual instructions executed. A sequence of non-preempted arithmetic instructions and memory access is not recorded, because it is already executed deterministically by the processor. Again, only when we cross the boundary into non-reproducible behaviors do we append to the trace file. As a simple example of non-reproducibility, consider this C program that makes a system call to print out the current time:

#include<unistd.h>
#include<time.h>
#include<stdio.h>

int main(){
  struct timespec res;
  clock_gettime(CLOCK_REALTIME, &res);
  printf("res: {.tv_sec = %lu, .tv_nsec = %lu}\n", res.tv_sec, res.tv_nsec);
  return 0;
}

rr can dutifully record these calls and play them back.

$ rr record ./clock.bin
res: {.tv_sec = 1557114688, .tv_nsec = 749693899}
$ rr record ./clock.bin
res: {.tv_sec = 1557114840, .tv_nsec = 513101403}
$ rr replay --autopilot
res: {.tv_sec = 1557114840, .tv_nsec = 513101403}

Here, the first two record calls are separate, unrelated executions. They have no relationship from rr's perspective. The replay succeeds in reproducing the last recording. If you're curious, rr leaves behind a directory with 7 files and 60K bytes for each recording of this small program.

In contrast, a DE system that enforces end-to-end deterministic execution must ensure that a computation has no implicit inputs -- including time -- and thus two calls to the clock program must yield the same result:

$ det ./clock.bin
res: {.tv_sec = 744847200, .tv_nsec = 0}
$ det ./clock.bin
res: {.tv_sec = 744847200, .tv_nsec = 0}

That is, unless the current time is explicitly provided as an input to the computation, for example, via a command-line parameter, container configuration file, etc.

Where'd the I/O go?

In the above example, the program's only I/O was printing to the screen. Thus the rr replay appeared identical to the record. But what if we run a program with side effects on the disk or network? Like so:

$ cat test.sh
#!/bin/bash
echo hello > hello.txt
echo done

When we run the first time, we create the file:

$ rr record ./test.sh
done
$ ls hello.txt
hello.txt
$ rm hello.txt

Then when we replay, we don't create a file at all:

$ rr replay --autopilot
$ ls hello.txt
ls: cannot access 'hello.txt': No such file or directory

That's right. Again, the only reason to replay is to examine the internal states of the program for debugging. Actually reexecuting the side effects of the program is not needed and would probably get in the way.

Yet if we're interested not in reproducing a bug, but in creating a reproducible computation then of course we need the outputs. For example, if we're reproducible building sources into binaries, we need to write the output binaries. Reproducible builds, which we'll discuss in a future post, are a better fit for DE than RnR.

Two kinds of "reproducer"

In spite of the above differences in usage, both RnR and DE offer a way to provide a witness, or reproducer, to a certain software behavior. Whether in reproducible builds, or another scenario, the idea is to verify that a computation applied to a known input produces a specific output. The interaction might go like this:

Person A: I claim "f(x) = y". <Send reproducer>.
Person B: <Consume reproducer>. Validated.

In the DE case, the reproducer can be (f, x, hash(y)), including the computation f and its input x. The computation f could be represented as code, a container, or a VM. Person B simply reruns the computation to validate it, achieving a determinstic answer y output, which can be hashed and compared against the expected value.

In the RnR case, the reproducer is (f, x, trace, y) and the receiver must validate that the trace represents a valid execution for f(x), confirming along the way that the output streamed out matches y. Even though Mozilla rr is not engineered to replay side effects, it could be modified to verify that, e.g., each write to a file checks that output matches the appropriate piece of the expected-output y.

Validating traces is a greater challenge than merely checking output. Validation must ensure that the trace file could have plausibly come from running f(x), i.e. a true recording. We don't want a trace file that records 1 + 1 = 3, causing a naive receiver to replay a faulty, impossible execution that seems to validate the original f(x) = y claim. Furthermore, a robust replay mechanism would need to treat a trace received over the network as untrusted input and make sure that it cannot hijack the validation in some other way.

Finally, perhaps the most critical difference between RnR and DE for providing reproducers is that trace files are not human readable. If the computation f(x) is packaged in a source form, a human can inspect it, and perhaps experiment by running modified versions of it. If they instead receive gigabytes of binary trace file, the opportunity to audit or modify vanishes.

The record-vs-determinize spectrum

Ultimately, the choice between making a behavior a-priori repeatable (deterministic) vs. recording its outputs can be decided at a fine-grain. For instance, do we want to record clock timings, or make timings repeatable? Do we want to record network traffic, or create a larger deterministic execution unit that includes an entire cluster of machines? These choices are orthogonal.

If there's any communication permitted beyond the boundary of repeatable computation, then we're pushed into the RnR camp.
These foreign interactions mandate that some nonrepeatable information must be recorded in a trace file, at least if we want the result to be self-contained. But even within RnR, the design space remains nuanced, because the amount of irreproducibility can vary wildly, and can be quantified in bits-per-second (or bits per instruction) of nondeterministic information.

As an example of a compute model with nondeterminism isolated to one part of the system, I spent my PhD. working on WaveScript. WaveScript is a programming language with a dataflow execution model where the only nondeterminism comes from a real-time merge operator on event streams. The left/right/left choices at those merge points are the only nondeterminism, and thus the only data that must be logged to make an execution reproducible. (Conveniently, they're even binary choices that can be directly mapped to a bitstream.)

But in this post we're concerned with unconstrained binary programs; these use the full modern system call and instruction set interfaces, just like the examples above. In executing these programs, there are many small choices between repeatability and recording. For example, if we determinize multi-process scheduling, we remove the need to record OS scheduling decisions. That choice is added to many similar choices such as what to do when the program reads timers or requests random numbers.

The more features we choose to determinize, the less nondeterminism we need to log. Reducing the bits-per-second recorded of course means using less disk. But it also reduces the surface area to worry about with respect to the validation concerns mentioned above. Whether recording, or recasting as deterministically repeatable, the end goal is to eliminate uncontrolled factors that cause executions to run astray, frustrating the process of debugging, or frustrating attempts to produce and verify a consistent output.

In a future post we'll look at the DE/RnR tradeoff through the lens of other applications. For instance, in distributed algorithms, RnR and DE correspond to different communication patterns. Deterministic execution can guarantee that two replicas compute the same result, given the same starting state and inputs. In contrast, RnR corresponds to recording the behavior of a primary replica (and transmitting it to a secondary), which is the approach used by existing VM replication technologies.

Ryan Newton