# Compiler Quest

Bootstrapping a compiler is like a role-playing game. From humble beginnings, we painstakingly cobble together a primitive, hard-to-use compiler whose only redeeming quality is that it can build itself.

Because the language supported by our compiler is so horrible, we can only bear to write a few incremental improvements. But once we persevere, we compile the marginally better compiler to gain a level.

Then we iterate. We add more features to our new compiler, which is easier thanks to our most recent changes, then compile again, and so on.

## Parenthetically

Our first compiler converts parenthesized combinatory logic terms to ION assembly. For example:

parse "BS(BB);Y(B(CS)(B(B(C(BB:)))C));"
==  "BSBB;YBCSBBCBB:C;"

We assume the input is valid and contains no numbers in parentheses or square brackets (so it only uses @ to refer to previous terms and # for integer constants).

term acc (h:t)
| h == ')' || h == ';' = (acc, t)
| h == '('             = uncurry (app acc) (term "" t)
| otherwise            = if h == '#' || h == '@'
then app acc (h:head t:"") (tail t)
else app acc (h:"") t
where app acc s = term (if null acc then s else '':acc ++ s)

parse "" = ""
parse s = let (p, t) = term "" s in p ++ ';':parse t

To translate to combinatory logic, we first break our code into more digestible pieces:

pair x y f = f x y;
(||) f g x y = f x (g x y);
(++) xs ys = xs ys (\x xt -> x : (xt ++ ys));
ifNull xs a b = xs a (\_ _ -> b);
add r acc p = r (ifNull acc p ('':(acc ++ p)));
isPre h = h('#'(==)) || h('@'(==));
suffix f h t = isPre h (t undefined (\a b -> pair (a:[]) b)) (pair [] t) (\x y -> f (h:x) y);
atom r h acc t = suffix (add r acc) h t;
sub r acc = uncurry (add r acc) . r "";
closes h = h(';'(==)) || h(')'(==));
if3 h x y z = closes h x (h('('(==)) y z);
switch r a h t = if3 h pair (sub r) (atom r h) a t;
term acc s = s undefined (\h t -> switch term acc h t);
parse s = term "" s (\p t -> ifNull p ";" (p ++ (';':parse t)));

Scott-encoding pairs implies uncurry is the T combinator, and (.) is an infix operator with the same meaning as the B combinator. Thus the above becomes:

BCT;
BSBB;
YBCSBBCBB:C;
BRBKKBB;
CBBBSBS@#BB:#@";
SB@!T##=T#@=;
BSBCCBSCBB@%CT?B@ C:K@ KCBBB:;
BCBB@&@$; SBCBBBBBT@$TK;
SB@!T#;=T#)=;
SBCBBBBBB@)T#(=;
BCSBSBCC@*@ @(@';
YBBCT?@+;
YBSTKBBKBBKBC@,KBCBB@"B:#;;

We added a newline after each semicolon for clarity; these must be removed before feeding.

## Exponentially

Our next compiler rewrites lambda expressions as combinators using the straightforward bracket abstraction algorithm we described earlier.

The following are friendlier names for the combinators I, K, T, C, Y, and K, respectively.

id x = x;
const x _ = x;
(&) x f = f x;
flip f x y = f y x;
fix x = x (fix x);
Nothing x _ = x;

Our program starts with a few classic definitions. P stands for "pair".

Just x f g = g x;
P x y f = f x y;
(||) f g x y = f x (g x y);
(++) xs ys = xs ys (\x xt -> x : (xt ++ ys));

As combinators, we have:

BKT;
BCT;
BS(BB);
Y(B(CS)(B(B(C(BB:)))C));

Parsing is based on functions of type:

type Parser x = [Char] -> Maybe (x, [Char])

A Parser x tries to parse the beginning of a given string for a value of type x. If successful, it returns Just the value along with the unparsed remainder of the input string. Otherwise it returns Nothing.

Values of type Parser x compose in natural ways. See Swierstra, Combinator Parsing: A Short Tutorial.

pure x inp = Just (P x inp);
bind f m = m Nothing (\x -> x f);
(<*>) x y = \inp -> bind (\a t -> bind (\b u -> pure (a b) u) (y t)) (x inp);
(<$>) f x = pure f <*> x; (*>) p q = (\_ x -> x) <$> p <*> q;
(<*) p q = (\x _ -> x) <$> p <*> q; (<|>) x y = \inp -> (x inp) (y inp) Just; These turn into: B(B@ )@!; B(C(TK))T; C(BB(B@%(C(BB(B@%(B@$))))));
B@&@$; B@&(@'(KI)); B@&(@'K); B(B(R@ ))S; Our syntax tree goes into the following data type: data Ast = R [Char] | V Char | A Ast Ast | L Char Ast The R stands for "raw", and its field passes through unchanged during bracket abstraction. Otherwise we have a lambda calculus that supports one-character variable names. Scott-encoding yields: R s = \a b c d -> a s; V v = \a b c d -> b v; A x y = \a b c d -> c x y; L x y = \a b c d -> d x y; which translates to: B(BK)(B(BK)(B(BK)T)); BK(B(BK)(B(BK)T)); B(BK)(B(BK)(B(B(BK))(BCT))); B(BK)(B(BK)(B(BK)(BCT))); The sat parser combinator parses a single character that satisfies a given predicate. The char specializes this to parse a given character, and var accepts any character except the semicolon and closing parenthesis. We use sat (const const) to accept any character, and with that, we can parse a lambda calculus expression such as \x.(\y.Bx): sat f inp = inp Nothing (\h t -> f h (pure h t) Nothing); char c = sat (\x -> x(c(==))); var = sat (\c -> flip (c(';'(==)) || c(')'(==)))); pre = (:) <$> (char '#' <|> char '@') <*> (flip (:) const <$> sat (const const)); atom r = (char '(' *> (r <* char ')')) <|> (char '\\' *> (L <$> var) <*> (char '.' *> r)) <|> (R <$> pre) <|> (V <$> var);
apps r = (((&) <$> atom r) <*> ((\vs v x -> vs (A x v)) <$> apps r)) <|> pure id;
expr = ((&) <$> atom expr) <*> apps expr; As combinators, we have: B(C(TK))(B(B(RK))(C(BS(BB))@$));
B@/(BT(T=));
@/(BC(S(B@"(T(#;=)))(T(#)=))));
@&(@':(@*(@0##)(@0#@)))(@'(C:K)(@/(KK)));
C(B@*(C(B@*(S(B@*(B(@((@0#())(C@)(@0#)))))(B(@&(@((@0#\)(@'@.@1)))(@((@0#.)))))(@'@+@2)))(@'@,@1);

## Term rewriting; bootstrapping

Corrado Bőhm’s PhD thesis (1951) discusses self-compilation and presents a high-level language that can compile itself.

John Tromp lists rewrite rules to further reduce the size of the output combinatory logic term after bracket abstraction.

Chapter 16 of The Implementation of Functional Programming Languages by Simon Peyton Jones gives a comprehensive overview of this strategy.

Ancient philosophers thought about bootstrapping compilers but their progress was paltry. Or poultry, one might say.

CakeML is a formally verified bootstrappable ML compiler.

Ben Lynn blynn@cs.stanford.edu 💡