NOTE: I wrote this based on the same casual tech talk I gave in USTC and SJTU. The talk is recorded in Chinese, thus I wrote this to share the ideas to more people :)
Why am I speaking about these three things? First, they are often paid poor attention in the regular CS education curriculum; second, they are subtly sharing some commonalities, which might not be intuitive at the first sight.
First, what is a build system? It is a tool that transforms and composes the source code and other resources into a single software application. The fundamental reason why we need it is that there are multiple layers of abstraction between source code and instructions run on your computer. Usually, these transformations are performed by other tools called compiler and linker etc. However, building software can be a complex process, which poses significant challenges for the team. For example, in the early days of Window NT kernel development (see book Show Stopper!: The Breakneck Race to Create Windows NT and the Next Generation at Microsoft), windows team needs one (or two) specific person (people) who knows all the tricky steps and specific environment setup in order to produce a complete executable kernel from source code. Since it is a tedious job and compilation become slower and slower when the codebase grows, this specific person might want to construct an automated tool to perform the steps. It would be nice to support dependency tracking, such that things are only built when their dependencies changed.
To give you an example, this is a short snippet for Makefile, a declarative language for describing the dependencies and the steps to produce target file(s) from dependencies. One of the most intuitive ways to present this file is to interpret it as a dependency graph. But, what does this mysterious title, “Non-recursive Make Considered Harmful”, mean here? What you need to know is that Makefile scales poorly, especially for big projects.
Concretely, here is part of the (many)
Makefiles used to build GHC, a compiler for Haskell language.
This is a very complex project (e.g. platform sensitive compilation, hybrid source code language, dynamic code generation etc.), and you can see that the resulted Makefile is hardly readable, not to say being maintainable. This paper tried to use a new build system called Hadrian to rewrite the build system of GHC compiler. The picture on the right is an example of build script in Shake. This is actually Haskell code, and the library uses type system of Haskell to conveniently encode the build system’s semantics correctly & efficiently while permitting you to use any features or libraries already in Haskell.
This migration project is called Hadrian, which I contributed to two years ago and now has been part of GHC and will be used as the default build system soon.
However, build system can be used to build things beyond software applications. For example, if you are working a ML project involving the data pipeline that transforms raw data into feature vectors (this process is called ETL, Extract-Transform-Load) and then used to train a ML model, either online or offline, you might have the needs such as changing some code/infra in the later stages but doesn’t want to rerun everything in the earlier stages from scratch, which can be a long time.
As a mid-talk quiz/question, I want to ask you to think about the neural network as a computation dependency graph, and how we could use knowledge in build system, lazy evolution and incremental computation to make it more efficient.
The transformation from code to computed results is called evaluation, and then there will be rules describing possible transformations. However, constructs that look exactly the same (i.e. with same syntax structure) can be evaluated (interpreted) differently, resulting in different runtime behaviors or even different results. Lazy evaluation is the evaluation strategy that puts off the computation (transformation) of expression as late as possible. Why is it useful? It gives extra power to the syntax language, makes it possible to describe an infinitely long data structure. This extra power enables a more declarative syntax style, which looks like math formula, and thus makes your programs look simpler.
fibs :: [Int] fibs = 1 : 1 : zipWith (+) fibs (tail fibs)
Q: Haskell is a functional language, but most functional languages are interpreted – how is haskell compiled to machine code?
A: Haskell is not like C, which is compiled to machine code while preserving the basic structure of the code. Haskell code will undergo several layers of transformations into STG code (Spinless tagless G-machine), which are graph level operations/reductions. Then, STG is compiled to low-level language (C/ASM/LLVM), and linked to a small RTS (runtime system) compiled from C code. For more details, see https://aosabook.org/en/ghc.html.
To give you another concrete application of this, here is the data flow block graph for Spark, a Big Data computation framework. Spark lets you define the data transformation you want through expression APIs based on map and reduce operators. Then, Spark constructs a dependency graph based on the final expressions. It only performs the necessary computation if the user explicitly requires it.
When we are solving an algorithm problem, we have access to all the input data, and we compute the output based on all of them in a single batch. However, in the real world production scenarios, it is a common need to build a daemon program, which handles incoming requests, non-stop.
In this case, the input changes incrementally, and we can’t afford recomputing the output each time it changes. (Concretely, services at a scale like Google might receive millions of update requests per day, and each request can effectively change the input to your aggregation algorithm, for example, page-rank).
Similarly, another related work is called “data-structure synthesis”. It is a sub-topic in the area of program synthesis, and its goal is to synthesize data structures with state e.g. interface implementation or C++ class. One such work is called Cozy, and it allows user to specify the behavior of intended programs by a logical specification of the query and operator semantics. The value of the tool comes from its ability to synthesize a program that satisfies the specification and also optimize it, in a large sense by automatic incrementalization. Consider the example on the right, a good implementation of this spec would be using an auxiliary data structures for maintaining the maximum value with each update operation, and simply return this value when being called. In the future, this style of work might be used to speed up data-intensive applications automatically in scale.
Thinking about them…
The questions are:
- How are these three concepts related?
- How are they relevant to my daily engineering tasks?
- How will the future of algorithm development look like?
First, I’d like to use a graph-based abstraction to unify three concepts. Each node might be a data or operator, while each edge is a “depends-on” relationship. This graph might be statically determined or changeable throughout the dynamic evaluation process.
Now, in a build system, each node might be an I/O operation or command-line program; in the lazy evaluation, we might be able to do computation over the dependencies, and evaluation order will be important in this case. Finally, the incremental computation can be understood as a shortcut that computes changed output directly based on the change in the input, without the need for recompilation using the full input data).
In fact, our daily engineering tasks might include many such small facilities or tricks already:
In the future, I think there are two trends:
- Decouple of performance optimization and algorithmic specification (business requirements understanding): Programmers will continue diverging into two groups of people, one building better and better infra tools, and another group will focus on mastering a wide range of tools, composing them and using them to express the business logics.
- Advancement of practical, backward-compatible program synthesizers for automatic algorithm development: Previous work in this area focuses on building synthesizers for newly designed, standalone specifications (either in logical expression, I/O examples or predictive, pattern-based systems).