(2015-05-29)
It should be possible to “perform in-memory computations on large,
even decentralized clusters in a fault-tolerant manner”, as Apache
Spark does, using Vesta-like build-step isolation, but with the shell
usability of redo
, using a git-annex
-like blob backend, thus
expanding the applicability of Spark-like computational structures far
beyond the data center.
This requires substantial explanation.
Apache Spark is a system for making high-performance, fault-tolerant clusters easy to use, by generating and managing things Spark calls RDDs:
[A]n RDD is a read-only, partitioned collection of records. RDDs can only be created through deterministic operations on either (1) data in stable storage or (2) other RDDs.…an RDD has enough information about how it was derived from other datasets (its lineage) to compute its partitions from data in stable storage.… in essence, a program cannot reference an RDD that it cannot reconstruct after a failure.
This sounds terribly similar to make
and software source control and
build systems, doesn’t it?
In 2010, Avery Pennarun implemented Dan Bernstein’s redo
design, which is a simpler approach to what make
does. If you
want to build a file named foo.o
, you run redo foo.o
, and redo
will run the shell script foo.o.do
if it exists, or otherwise
default.o.do
if it exists, or otherwise default.do
, searching up
the filesystem hierarchy. This .do
file is expected to put the
desired contents of foo.o
into an output file whose name is passed
to it.
A minimal useful default.o.do
might say
redo-ifchange "$2".c; cc "$2".c -o "$3"
The first command there is a recursive invocation of redo
, which
tells redo
that this output file is going to depend on foo.c
, so
it had better make sure foo.c
is up-to-date before continuing. This
dependency is recorded for later. (redo-ifchange
takes multiple
arguments to build dependencies in parallel.)
A more complete default.o.do
would take #include
dependencies into
account.
redo
does have a couple of limitations. One is that it doesn’t
handle multiple output files, like those from yacc
, very well.
Another is that it is purely local; it doesn’t have a cluster mode,
although you may be able to use distcc
to get some of that
yumminess.
ccache is a build accelerator specific to C, C++, and Objective-C. It hashes your source code, include files, compiler (well, typically just its size, mtime, and name), command-line options, etc., with MD4, and stores the compiler’s output (including e.g. warning messages) in an on-disk cache in, normally, your home directory; future recompilation attempts whose inputs haven’t changed will just reuse the previous compiler output.
In theory, this means that you could get most of the benefits of
make
for C and especially C++ programs by just wrapping your
compiler in ccache
inside your build script; rerunning the build
script would rehash all your source files, copy the object files into
your current directory, and then relink the executable. Depending on
what you’re compiling, this might be almost instantaneous, or it might
be very slow.
Because Unix provides ccache
with no reliable way to get a secure
hash of the source files from the filesystem, it has to read them in
their entirety to figure out whether they have reverted to an old
version. I tried it just now on my netbook on a tiny three-kilobyte
GLUT program, and it ended up reading about a megabyte of .h
files
in order to figure out that it could safely reuse the 2.7kB .o
file
from a previous compilation, taking 39 milliseconds in all, even
though it made only about 365 system calls.
Suppose I do this:
sqlite> create table foo (x varchar, y varchar);
sqlite> create index foox on foo (x);
sqlite> insert into foo values (3, 4);
sqlite> insert into foo values (4, 5);
foox
is the name of an index, which is a sorted copy of a column x
of table foo
with rowids.
Now, if I do a query on foo
like select * from foo where x = 3;
,
this query can use the index foox
to find the relevant rows. (In
fact, in SQLite3, explain
tells me it does, even when there are only
two rows in the table.)
For this to work properly, foox
has to be updated every time I
insert a new row into foo
. But this is tricky! If I have a million
rows in foo
and I add a new one whose x
value is close to the
minimum of the x
values, then if foox
is simply stored as a
vector, the database might have to move a million values down by one
in order to make room for the new x
in its proper sorted order.
There are several solutions to this, but the typical one (and what
SQLite does) is to store your index in a B-tree, which allows you
insert in the middle of it relatively efficiently.
A different approach, and more or less the one Lucene uses, is to
accumulate your updates in a small “side file” until there are a lot
of them, and then apply all of those updates at once to generate a new
sorted foox
. Until foox
has been replaced with the new version,
every query to foox
must also check the side file to see if there
are updates it’s interested in; this can be made more efficient by
sorting the side file, at which point you may begin to desire to have
a side file for the side file.
The database index is data that depends on the table, and it needs to
be possible to incrementally and transparently recompute that data
when the table changes. This is the same kind of automatic
recomputation problem that Spark, make
, and redo
attempt to solve.
Vesta is a source-control system integrated with a build system; it versions your whole build environment, and it runs each build step in an isolated chroot environment where it can only access data via Vesta, which functions as an NFS server. This ensures that the build step is only accessing a particular version of the source file, of the compiler, etc., and that Vesta can correctly record these dependencies.
Unfortunately, Vesta was proprietary for many years before finally being released publicly, at which point its authors stopped maintaining it; it had its own purely-functional programming language that you were required to use to describe the build process for your system; because its build scripts were written in that same language, building it from source required having a running instance of Vesta; and, since access was provided via an NFS-server interface, you had to have root (at a time before it was commonplace and easy to run virtual machines in VirtualBox, QEMU, or EC2) to try it. Vesta was used in production by DEC’s Alpha processor design team for a couple of years, but perhaps because of obstacles like those mentioned above, it never achieved wide usage, even within DEC or Compaq.
git-annex does not do any dependency management. Instead, it manages an efficient, decentralized, redundant immutable blob store with decentralized, replicated, eventually-consistent metadata, built on top of git. It uses symlinks to permission-read-only files to provide normal filesystem access to the immutable blobs without going to the lengths of implementing its own filesystem, as Vesta does; this is usually enough to prevent you from accidentally corrupting your local copy of the blobs, and if you do corrupt one copy of a blob, you can still restore it from a remote repository.
Spark has the lineage and determinism stuff unnecessarily coupled to a bunch of stuff about keeping Java objects in memory and partitioning and records, which seems like it is somewhat extraneous, although apparently Java does kind of need that in order to run efficiently. (You’d think java.nio.MappedByteBuffer would have largely eliminated this problem, but apparently not.) Spark’s ability to “perform in-memory computations on large clusters in a fault-tolerant manner” does not depend in any way on Spark’s knowledge of the internal structure of partitions (that they contain records) nor of Spark’s knowledge of which partitions are associated together to form an RDD. All Spark needs to know to recreate a chunk of data is really how to redo the computation that created that chunk, and how to recreate the inputs that went into that computation.
However, unlike redo
, in-memory computations on large clusters
definitely need to be able to produce multiple output chunks, or
partitions or whatever. And, if those chunks are stored in
distributed memory on a cluster, it’s probably a good idea to send the
computation to where the data is.
Also, the granularity of the computation is likely larger than what
redo
has to deal with, which means that the system has more latitude
to do computations and heavy-duty setup than in redo
or make
. It
could run an A* search, for example, to choose among possible
alternative plans.
So here’s what I propose. Build steps nest. Build steps run in a contained environment, like what lxc provides, or if that’s not possible, a directory full of symlinks to read-only files from a git-annex-like chunk store. When you launch a build step, which you can do from the Unix command line, you explicitly list its input files, including the script for the build step, which causes them to appear in its contained environment. When it terminates, it leaves behind an output directory. All of the input and output files, as well as the directories containing them, are stored in a distributed chunk store, and their provenance is recorded in a distributed, replicated metadata store; this is very similar to git-annex. Build steps are presumed to be deterministic, and they are isolated from their environment to the extent possible, which reduces their nondeterminism. So, if you invoke a build step for which the system already has the results cached, it will retrieve those results from the chunk store. And the system may invoke the build step on a remote machine, if that's where the data is, and then replicate the results onto your local machine.
Lxc/virt-sandbox might impose some 200ms launch overhead on build steps, but that should probably be okay. If it’s not, Debian has fakechroot, which is an extremely efficient way to use a wrapper library to fake out some system calls to trick programs into thinking they don’t have access to the whole filesystem.
This allows you to write your build script as a shell script, which can invoke other scripts, possibly in a loop, and use their results. Once you have run your build script, which will usually be instantaneous, you can be sure that all of the build results are present.
When a build step invokes a subordinate build step, it names the input files it wants to provide to the subordinate step using the filenames it knows them by, and optionally put them at a particular place in the child’s namespace; but the system invoking the subordinate step uses their immutable blob hashes in the hash keys, not the filenames known by the parent build step. This means that the same build step, running the same commands identically many times in different contexts, can invoke a sub-build-step that does different things.
The top-level build script might also snapshot other parts of the firesystem, copying them into the blob repository, in order to make them available to subordinate build steps. For example, substantial parts of /lib and /usr/lib may be necessary.
(It would be nice to use sysdig or strace or something to figure out what files are actually being accessed, without having to write a filesystem.)
Ideally you could concatenate several chunks into a single big output file in a purely virtual way, i.e. without actually copying the data.
I have an 800-gigabyte stock-market data set. If this were split into chunks of some 64 megabytes, it would be about 12500 chunks. If I generate another 200 gigabytes of derived data from that, it will be another 3125 chunks. If each of those chunks derived from, say, 400 input chunks, the SHA-1s of all of those chunks (the lineage) would be 8 kilobytes; all the metadata together would be 25 megabytes. Incrementally replicating 25 megabytes of lineage data will be no problem.
One of the interesting things about MapReduce is that any algorithm that can be expressed with it can be implemented with a small number of sequential passes over the data, so you can use it to improve locality of reference as well as for fault-tolerant cluster-scalable computation.
Although this approach is not nearly as extreme as MapReduce in that way, it may also have some virtue in that direction, since it encourages you to break up your computation into small pieces that consume a few small inputs and produce a few small outputs; in cases like the Spark reimplementation of the Pregel programming model, this may involve “transposing” an algorithm, separating things that were previously together and bringing together things that were previously separate. If you were to run only one build step at a time, you might be able to improve your in-memory performance dramatically.
Well, it’s been a year and still nobody has done the above, so I guess I should do some work on it. What does a minimal executable version of it look like?
Maybe you have these pieces:
Do you really need all of those? What does the interface to creating output files look like?
Maybe a first step would be to think about what distributed MapReduce in this context looks like.