Pure Programs: Pure functions all grown up

Several of today's programming languages are well-suited to writing pure functions, but what about scaling to pure programs? Here we define a "pure program" as one whose external outputs (files, network) are a function only of its explicit inputs. This definition implies deterministic behavior; and, in fact, it is much easier to settle on than formal definitions for a purely functional language, though we'll revisit below some of the quibbles about exactly what constitutes the "same" output from two runs of a program.

In this post, I won't reiterate arguments for functional programming, only observing that the usual arguments apply no less at the large scale than the small. The main idea of this article is that we can consider functional semantics as we move along a spectrum of increasing granularities, from the single machine instruction to the entire program execution. At some point on this spectrum, support for enforcing purity/determinism invariably tapers off. We'll look at a couple ways to remedy that, extending purity to entire program executions.

Starting at the smallest granularity, we note that the tiniest primitive functions are almost always pure, across all programming languages. This holds in C++ as much as Haskell, i.e. irrespective of whether a language actively enforces purity. For example, arithmetic operations like plus are pure functions, whether they employ modular, saturating, or unbounded arithmetic:

(+) :: Int -> Int -> Int

Indeed, often such basic building blocks compile to as little as one machine instruction, and the vast majority of machine instructions are themselves pure functions on registers, cpu flags, and memory. But as we increase scale from arithmetic primitives to medium-sized library functions, like data structure operations, then programming languages part company, diverging in their approaches. Functional languages more often emphasize persistent, immutable data structures:

(:) :: a -> [a] -> [a]
insert :: Key -> val -> Map Key v -> Map Key val

Above, when we insert a new entry into a finite map with insert k v m, the result is an updated version of the map, leaving the original value untouched. That makes insert a pure function. But when we scale further --- from manipulating a data structure to running an entire program or web server --- then even in Haskell we inevitably encounter the IO monad:

main :: IO ()
Snap.Http.Server.httpServe :: Config Snap a -> Snap () -> IO ()

Yes, in Haskell we can tell ourselves httpServe remains a pure function; it just happens to return a monadic action. This is cold comfort, however, as the real determinate of program behavior is the abstract machine that executes the IO action. Unfortunately, that machine is somewhat underspecified in Haskell, lacking, for example, a memory model (though sequential consistency may work well in Haskell's case).

Thus Haskell programmers hold on to compiler-aided purity for longer than many languages, but lose it in the end, and certainly well before creating an entire application. Next, we'll look at a couple ways this could be remedied: first, a language-based approach to pure programs, and second, a runtime sandboxing approach.

Option 1: a monad for pure programs

In an OOPSLA 2017 paper, we explored a language-based approach that began with taming the Haskell IO monad, creating a restricted DetIO monad. For example, normally the below Haskell code would be highly nondeterministic:

main = do forkIO (putStrLn "hello ")
          putStrLn "world"

The output, to stdout, is impure. But with DetIO the analogous program is required to produce deterministic output:

{-# LANGUAGE Safe, OverloadedStrings #-}
module HelloWorld (main) where
import Control.Monad.DetIO
import Prelude hiding (putStrLn)

main :: DetIO ()
main = do t <- forkIO $ putStrLn "world!"
          putStrLn "Hello "
          joinThread t

The same printing side effects occur, but they must become observable in a canonical, sequential order, irrespective of their real-time parallel execution. Note that the above HelloWorld module is actually a library. The GHC compiler itself has no direct support for DetIO, and thus a harness program is needed to launch the above code, ensuring that nothing is running in the naked IO monad except a line like this:

-- Trusted code, can be generated by a launcher:
main = runDetIO HelloWorld.main :: IO ()

An interesting exercise is to ask how much functionality we can squeeze inside DetIO while maintaining purity. Naturally, we can use any pure [Safe] Haskell code, which includes the vast majority of preexisting library Haskell code. But we can go further. In the detflow library we built to explore this idea, we also allow dataflow variables for blocking, inter-thread communication. We can also use preexisting libraries that embody deterministic parallel programming models such as data-parallelism. You can find more about the detflow prototype in a talk about it by Ryan Scott.

Extending DetIO with foreign code

If we want to reuse as much existing code as possible inside pure programs, it would be nice to provide a "lift" function of type IO a -> DetIO a. Unfortunately, this is impossible, because we have no good way to make it run deterministically. Likewise we cannot support foreign function calls (FFI), which would execute native code right inside our process, with arbitrary effects on memory.

So what can we do to reuse existing code then? The answer is that we can keep it at arm's length: in child processes. DetIO provides a call to shell out, just like Haskell's System.Process module.

main :: DetIO ()
main = do x <- getLine
          system (“gcc -c ” ++ x)
          putStrLn x

But if we want to shell out to arbitrary binaries we need some kind of sandbox to ensure that those external programs don't violate purity invariants. The way we do this in our detflow prototype is to use a runtime sandbox that enforces file system permissions. We simply ensure that each DetIO thread has a disjoint set of owned file system paths, and that shelling out keeps the foreign process within the same sandboxed space.

As a complete example, the below code will compile two files in parallel, without allowing any interference between the gcc subprocesses.

import Control.Monad.DetIO

compile :: String -> DetIO ()
compile x =
  do system ("gcc -c " ++ x)
     putStrLn ("Done compiling " ++ x)

main :: DetIO ()
main =
  do a <- forkWPerms [R "foo.c", RW "foo.o"]
          ( compile "foo.c")
     compile "bar.c"
     joinThread a

The end result is that the output files on disk must be a pure function of the input files. To guarantee this, the runtime sandbox around gcc also needs to intercept other attempts to bring in impurity, such as timing or random number generation, as we've discussed in previous posts. Child processes are sandboxed, whereas the main process running DetIO need not be, since the type system enforces determinism (at least with -XSafe).

Option 2: Runtime sandboxing with deterministic containers

Now that we've seen above how to extend DetIO with arbitrary parallel, deterministic child-processes, it is natural to also sandbox arbitrary programs without any language-support whatsoever. Just start with the sandbox alone. We call these deterministic containers (or reproducible containers), and we touched on how they work in previous posts.

Here I'll only summarize to say that deterministic containers are the core technology we're using at Cloudseal to provide solutions for de-flaking tests and reproducing bugs.

Addendum: Quibbles on output equivalence classes

The astute reader might have many questions around how deep our definition of purity goes: we certainly don't want to make any claims about the exact state of a disk, or the inodes in a file system. So what is the appropriate level of abstraction?

We define a pure program as one that produces bitwise equivalent outputs on disk and network, however, we consider an abstraction of the raw file system. When running inside our sandboxes, programs only see files in canonical orders within directories, with canonical modtimes, and canonical inodes. Likewise, when considering the outputs of two program executions, we consider this same abstracted view: only the set of output files and contents of those files.

Ryan Newton