This year I had the pleasure to teach, for the first time, a course on compilers. I have been doing research on compilers for a while now but I suppose I never really asked myself seriously “what is a compiler?”. I could recognize a compiler if I saw one, but this is not good enough for teaching. One paper that gives an excellent conceptual explanation is Ager et. al.’s [PDF] but it might be a little hard to read if you are not familiar with abstract machines (Krivine, SECD) and with Standard ML syntax. In this post I will follow the same methodology, but as it applies to a very simple expression language which has only arithmetic, input, output and variable binding and a very simple abstract machine. This will make it, I hope, clearer what are an **evaluator**, **interpreter** and **compiler**, and how they are connected to the key notions of **abstract **and** virtual machine.**

I should specify that this is not a recipe for writing efficient or realistic compilers. This is an exercise in **understanding** what a compiler is. The second caveat is that the perspective presented here is quite common for functional or other higher-level programming languages but not quite as relevant for C, which is close enough to the concrete machine that an abstract machine is not really necessary. (Unless we consider LLVM-based compilation as using virtual machine.)

The code used in this post is OCaml. Why? Because OCaml is the best programming language bar none for implementing compilers. If you don’t believe me feel free to translate or reimplement my code in your favourite language and see how it goes.

## The object language

Let us consider this simple expression language:

e ::= n | x | e ⊙ e | let x = e in e | print e | read

where *n* stands for numeric constants, *x* for variable identifiers, and ⊙ for arbitrary arithmetic operators. We shall not deal with parsing at all as it is a totally separate topic, but we start from the abstract syntax tree, specified by this OCaml type:

type expression = | Const of int | Var of string | Binary of string * expression * expression | Let of string * expression * expression | Output of expression | Input

For example, the following expression

let z = (let x = read in let y = read in x + y) in print z

has the syntax tree:

Let ("z", Let ("x", Input, (Let ("y", Input, (Binary ("+", Var "x", Var "y"))))), Output (Var "z"))

## The evaluator

The first step in the formal definition of a programming language is giving its evaluation rules in an effective way. There are many ways to do this and they fall under the general topic of **operational semantics**, which can have several flavours (e.g. *big step*, *small step*, etc.) of which you can learn more in Andy Pitts’s excellent tutorial [link]. Without too much further theorising here is an obvious evaluator for our expression language:

let rec evaluate = function | Const n -> n | Binary (op, e, e') -> let v = evaluate e in let v' = evaluate e' in op_of_opcode op v v' | Let (x, e, e') -> let v = evaluate e in let e' = substitute x v e' in evaluate e' | Var x -> failwith ("undefined: " ^ x) | Output e -> let v = evaluate e in print_int v; print_newline (); v | Input -> read_int ()

Let’s discuss the code:

- A constant expression
*Const n*is evaluated to just*n*. - An operator
*Binary*evaluates the two arguments then applies the operator. The helper function*op_of_opcode*decodes the operator symbol into an actual (OCaml) operation:let op_of_opcode = function | "+" -> ( + ) | "*" -> ( * ) | ";" -> (fun x y -> ignore x; y) | (* and so on *)

- The
*Let*binder evaluates the definition*e*of*x*then replaces*x*with it in the body expression*e’*using a helper function to do the substitution:let rec substitute (x : string) (v : int) = function | Const n -> Const n | Binary (op, e1, e2) -> Binary (op, substitute x v e1, substitute x v e2) | Let (y, e1, e2) when x = y -> Let (y, substitute x v e1, e2) | Let (y, e1, e2) -> Let (y, substitute x v e1, substitute x v e2) | Var y when x = y -> Const v | Var y -> Var y | Output e -> Output (substitute x v e) | Input -> Input

Note that substituting in a

*Let*expression must ensure that the defined variable is*never*substituted as it would violate the obvious scoping rules of a local definition, hence the special case. - As a consequence, all
*Var*iables in a program are substituted at the point of definition se we shouldn’t encounter them. If we do, they must have been undefined so we raise a failure. *Output*evaluates its argument, prints it and returns it as a result. Since we are in an expression language we always need to produce some kind of integer result and this seems sensible.*Input*reads an integer and returns it.

This is a reasonable *definition* of our language’s semantics but it is rather impractical for two reasons:

- In order for this evaluator to execute the program it needs to modify the abstract syntax tree during the evaluation. This is clearly a bad idea for several reasons:
- Substitution or any other kind of AST manipulation will be a complicated operation which is unlikely to be easily mapped into smaller, machine-friendly operations.
- Substitution or any other kind of AST manipulation will always be inefficient in the extreme.
- Substitution is incompatible with using libraries, foreign-function interfaces, or runtime services.

- In order for the evaluator to execute a program it relies on the full power of the meta-language, in our case OCaml. Why is this a problem? We know that eventually we want our program to be executed by some machine, and it seems like a bad idea to assume that machine will be able to run OCaml, which is a big and complicated language requiring a sophisticated run-time environment. We need something simpler.

A compositional evaluator

We first remedy the problem of modifying the AST by giving a *compositional* evaluator in which the meaning of a term is defined from the meanings of its subterms. This is the key principle of *denotational semantics* which goes back to Scott and Strachey for programming languages and to Tarski for logics. To implement it we will add an *environment* to our evaluator, which is a dictionary mapping variable identifiers to values:

let rec cevaluate env = function | Const n -> n | Binary (op, e, e') -> let n = cevaluate env e in let n' = cevaluate env e' in op_of_opcode op n n' | Let (x, e, e') -> let env = (x, cevaluate env e) :: env in cevaluate env e' | Var x -> lookup x env | Output e -> let v = cevaluate env e in print_int v; print_newline (); v | Input -> read_int ()

The helper *lookup* function is:

let rec lookup x = function | [] -> failwith ("undefined: " ^ x) | (x', n) :: env when x = x' -> n | _ :: env -> lookup x env

Except for the treatment of variables, which are looked up rather than substituted, the two evaluators are identical. A mathematically-minded reader could *prove* (e.g. in Agda or Coq) some soundness results, e.g.

∀M.∀n. evaluate M = n ⇒ cevaluate [] M = n

Note that this is not entirely a trivial exercise because of the variable scoping rules! Also, keeping track of effects (input and output) presents some further formal challenges.

## Abstract machine and interpretation

The second step in our process is to try to write an evaluator restrained to particular operations. What operations? That depends on what our ultimate target architecture might be. We don’t necessarily want to use at this stage a genuine concrete machine specification (e.g. x86) because it would be extremely complicated, albeit doable — barely. What we want is an intermediate layer of abstraction. What we want is a set of operations that we believe can be refined efficiently into concrete machine operations. What we want is an *abstract machine.*

For functional languages the standard abstract machines are Krivine for call-by-name evaluation and SECD for call-by-value, but other abstract machines exist, for example the *Token abstract machine *used to interpret linear logic [link]. Designing abstract machines is a fascinating and fun exercise, a beautiful mix of engineering and mathematics. They truly are the nexus between *language* and *machine* which make compilation of higher-level languages possible.* *

For purpose of illustration I will use a totally made-up and very simple abstract machine, the PSM (*pure stack machine*). It is a stack equipped with some operations:

*push*a new element onto the stack*pop*the top of the stack*peek*at the n-th element of the stack*apply*an operation to the top two elements of the stack and leave the result instead*swap*the top two elements of the stack*ouput*the top of the stack*input*at the top of the stack.

let stack = ref [] let push x = stack := (x :: !stack) let pop () = stack := List.tl !stack let peek n = stack := (List.nth !stack n :: !stack) let apply op = stack := (function x :: x' :: xs -> (op x x') :: xs) !stack let swap () = stack := (function x :: x' :: xs -> x' :: x :: xs) !stack let output () = print_int (List.hd !stack); print_newline () let input () = stack := (read_int () :: !stack)

When we implement the abstract machine we purposefully move into the *imperative* world because machines are naturally imperative. Machines have a *state* which is changed by its operations. In our case this state is simply a global stack. This presentation can be made even nicer, at the cost of some sophistication, by packaging the stack as a functor which would hide all the implementation details from the user. I leave this as an exercise for the sake of a more elementary description.

We are now in a position to write an *interpreter* for this abstract machine, i.e. a *compositional evaluator* which will only use the abstract machine operations:

let rec interpret symt = function | Const n -> push n | Binary (op, e, e') -> interpret symt e; interpret (inc symt) e'; apply (op_of_opcode op) | Let (x, e, e') -> interpret symt e; interpret ((x, 0) :: (inc symt)) e'; swap (); pop () | Var x -> let n = lookup x symt in peek n | Output e -> interpret symt e; output () | Input -> input ()

Lets discuss the code:

- A constant expression is interpreted by pushing the constant onto the stack.
- A binary expression will interpret the operands and, assuming that the two values are left onto the top of the stack, will apply the operation.
- A let expression will interpret the definition
*e*, which will be left at the top of the stack then it will interpret the body*e’*remembering the offset of the value for*x*. We want the value of the body only to remain onto the stack se we swap the value of*x,*which is on the second position at the top, then we pop it out. - Variables are looked up at their known stack offset.
- Input and output are obvious.

The interpreter uses an extra argument, a symbol table mapping variable identifiers into stack offset, with helper function

let inc = List.map (fun (x, n) -> (x, n+1))

All the “execution” is now done in terms of abstract machine operations, even though we are using a few other operations for managing the symbol table. That is not a serious problem. In fact this mapping between variable identifiers and stack indices could be considered part of the parsing process.

At this point we have our *interpreter!*

The mathematically minded reader could now think about proving some *soundness* results about this interpreter, which even for a small language such as this one is quite a (formal, if not mathematical) effort:

∀M.∀n. evaluate M = n ⇒ run M = [n]

where

let run m = stack := []; interpret [] m; !stack

which means running the interpreter leaves a singleton stack containing the evaluated value.

## Compilation, virtual machines and native code generation

Once we have the interpreter, the compiler is surprisingly easy to implement. The interpreter *runs* the abstract machine instructions as it process the program. A compiler will simply *collect*, in a lazy way, the instructions (and save them to a file)!

let rec compile symt = function | Const n -> "push " ^ (string_of_int n) ^ "\n" | Binary (op, e, e') -> (compile symt e) ^ (compile (inc symt) e') ^ "apply " ^ op ^ "\n" | Let (x, e, e') -> (compile symt e) ^ (compile ((x, 0) :: (inc symt)) e') ^ "swap\n" ^ "pop\n" | Var x -> let n = lookup x symt in "peek " ^ (string_of_int n) ^ "\n" | Output e -> (compile symt e) ^ "output\n" | Input -> "input\n"

The code is strikingly similar. We could (almost) textually replace sequential composition with concatenation and the call to the function implementing the abstract machine instructions to the name of the instructions themselves. I have renamed both functions to *f* and placed the interpreter and the compiler side-by-side:

For illustration let’s look at one test expression, *m = let x = 6 in let y = 3 in x * y*:

# m;; - : expression = Let ("x", Const 6, Let ("y", Const 3, Binary ("*", Var "y", Var "x"))) # evaluate m;; - : int = 18 # cevaluate [] m;; - : int = 18 # run m;; - : int list = [18] # compile [] m |> print_string;; push 6 push 3 peek 0 peek 2 apply * swap pop swap pop - : unit = ()

The very last step is processing the output of the compiler, the *bytecode. *Here we have two choices:

- The bytecode can be interpreted by what is usually called a
*virtual machine*, which runs the bytecode on some concrete machine in the same way that our interpreter ran the original program on the abstract machine. Note that the bytecode is just a simpler programming language, but the principle remains the same. - The bytecode can be itself compiled just by tracing the execution of the
*virtual machine,*in the same way in which we traced the execution of the abstract machine to derive the object language compiler. Again, the principle is the same.

## Conclusion

Compiler development is still, unfortunately, somewhat of a black art. I hope this post was informative, and especially I hope that the methodological importance of the abstract/virtual machine came through. This is indeed the key concept that makes compilers work. A good abstract/virtual machine must be such that the object language can be interpreted/compiled by/to it efficiently and its own instructions, the bytecode, must be interpreted/compiled efficiently by/to the concrete hardware platform targeted.

I also wanted to illustrate the role that operational semantics (the non-compositional evaluator) and denotational semantics (the compositional evaluator) can play in rational compiler development, and how they can be gradually refined into interpreters and compilers. I hope that made sense!