The Compiler Singularity

Our next compiler is self-hosting: it can read its 100% organic artisanal handmade source and produce the corresponding ION assembly.

------------------------------------------------------------------------
-- Supports comments, spaces, variables consisting of lowercase letters.
------------------------------------------------------------------------

pair x y f = f x y;
just x f g = g x;

pure x s = just (pair x s);
bind f m = m @K (\x -> x f);
ap x y = \s -> bind (\a t -> bind (\b u -> pure (a b) u) (y t)) (x s);
fmap f x = ap (pure f) x;
alt x y = \s -> (x s) (y s) just;
liftaa f x y = ap (fmap f x) y;
many = @Y \r p -> alt (liftaa @: p (r p)) (pure @K);
some p = liftaa @: p (many p);
liftki = liftaa (@K @I);
liftk = liftaa @K;
sat f s = s @K (\h t -> f h (pure h t) @K);
char c = sat (\x -> x(c @=));

lcr s   = \a b c d -> a s;
lcv v   = \a b c d -> b v;
lca x y = \a b c d -> c x y;
lcl x y = \a b c d -> d x y;

com = liftki (char #-) (liftki (char #-) (many (sat (\c -> @C (c(#
@=))))));
sp = many (alt (char # ) (alt (char #
) com));
spc f = liftk f sp;
spch = @B spc char;
var = spc ( some (sat (\x -> (#z(x @L)) (x(#a @L)) (@K @I) )));

foldr = @Y \r c n l -> l n (\h t -> c h(r c n t));
anyone = fmap (@:) (spc (sat (@K @K)));
pre = ap (alt (fmap (@K @I) (char #@)) (fmap (@B @B @:) (char ##))) anyone;
lam r = liftki (spch #\) (liftaa (@C (foldr lcl)) (some var) (liftki (char #-) (liftki (spch #>) r)));
atom r = alt (fmap lcv var) (alt (liftki (spch #() (liftk r (spch #)))) (alt (fmap lcr pre) (lam r)));
apps = @Y \f r -> alt (liftaa @T (atom r) (fmap (\vs v x -> vs (lca x v)) (f r))) (pure @I);
expr = @Y \r -> liftaa @T (atom r) (apps r);
def = liftaa pair var (liftaa (@C (foldr lcl)) (many var) (liftki (spch #=) expr));
program = liftki sp (some (liftk def (spch #;)));

lsteq = @Y \r xs ys a b -> xs (ys a (\u u -> b)) (\x xt -> ys b (\y yt -> x(y @=) (r xt yt a b) b));
rank ds v = foldr (\d t -> lsteq v (d @K) (\n -> @B (@:#@) (@:n)) (@B t \n -> #0(#1 @-)(n @+))) @? ds # ;
shows f = @Y \r t -> t @I f (\x y -> @B (@B (@:#`) (r x)) (r y)) @?;

isfree f = @Y \r t -> t (\x -> @K @I) f (\x y -> (r x) @K (r y)) @?;
unlam f = @Y \r t -> f t (t @? (@K (lcr (@:#I))) (\x y -> lca (lca (lcr (@:#S)) (r x)) (r y)) @?) (lca (lcr (@:#K)) t);
babs = @Y \r t -> t lcr lcv (\x y -> lca (r x) (r y)) (\x y -> unlam (isfree (lsteq x)) (r y));

dump tab = foldr (\h t -> shows (rank tab) (babs (h (@K @I))) (@:#;t)) @K tab;
main s = program s (@:#?@K) (@B dump (@T @K));

As the comment says, our parser now handles comments, whitespace, and variables consisting of lowercase letters. We also added code to look up the index of a top-level definition. Prefixing a character with @ gives us direct access to the primitive combinators. (Not to be confused with the @ operator of ION assembly.)

Our language is now friendly enough that we are willing to work in it with our bare hands. However, bootstrapping to reach this singularity requires more effort. For the sake of our previous compiler, we:

  • Remove spaces and newlines.

  • Remove comments.

  • Rename variables like xs and tab to single letters.

  • Strip the @ tag from combinators like @K.

  • Rewrite foo x y = as \x.\y.

  • Rewrite \x y -> as \x.\y.

  • Replace defined symbols with @ and a character indicating where they appeared: we refer to the nth definition with the character with ASCII code n + 31.

Doing this by hand, while feasible, grows tiresome and error-prone if we often make changes. Thus we automate with sed and awk, which are great tools for the job. For example, we can generate a lookup table, where we assume an equals sign is preceded by a space if and only if it’s the equals sign of a definition.

$ cat singularity | sed  -n '/.* =/{s/ .*//;p}' | awk '{printf "@%c",NR+31; print " " $0}'
@  pair
@! just
@" pure
@# bind
@$ ap
@% fmap
@& alt
@' liftaa
@( many
@) some
@* liftki
@+ liftk
@, sat
@- char
@. lcr
@/ lcv
@0 lca
@1 lcl
@2 com
@3 sp
@4 spc
@5 spch
@6 var
@7 foldr
@8 anyone
@9 pre
@: lam
@; atom
@< apps
@= expr
@> def
@? program
@@ lsteq
@A rank
@B shows
@C isfree
@D unlam
@E babs
@F dump
@G main

Then the following script performs our list of chores.

#!/usr/bin/env bash
cat singularity | sed '
/^-/d
s/[a-z]*\(.* \)=/\\\1->/
s/\<[dxv]s\>/p/g
s/\<ys\>/q/g
s/\<tab\>/a/g
s/\<xt\>/t/g
s/\<yt\>/u/g
s/#@/_/g
s/@//g
s/_/#@/g
'"$(cat singularity | sed  -n '/.* =/{s/ .*//;p}' | awk '{printf "s_\\<" $0 "\\>_@%c_g\n",NR+31 }' | sed 's/&/\\&/g')"'
s/\(\\[a-z ]*\)\([a-z]\) *->/\1->\\\2./g
s/\(\\[a-z ]*\)\([a-z]\) *->/\1->\\\2./g
s/\(\\[a-z ]*\)\([a-z]\) *->/\1->\\\2./g
s/\(\\[a-z ]*\)\([a-z]\) *->/\1->\\\2./g
s/\(\\[a-z ]*\)\([a-z]\) *->/\1->\\\2./g
s/\\ *->//g' | sed -z '
s/# /#_/g
s/@ /@_/g
s/ //g
s/_/ /g
s/\([@#][@#]\) /\1/g
s/#\n/#_/g
s/\n//g
s/_/\n/g
'

Caveats:

  • We assume there are at most 5 variables in a lambda. One regex rewrites \a .. y z -> as \a .. y -> \z., and we simply repeat this another 4 times, followed by a regex that removes empty lambda abstractions.

  • Due to lack of regex lookaround, we temporarily replace patterns like literal spaces and newlines with underscores, which will be restored after certain other patterns are removed. Thus we assume no underscores appear in the original source.

  • We need an extra regex to remove spaces after an escaped escape character. This assumes our source never contains an escaped space after an escaped escape character.

  • The ampersand has a special meaning in regexes that we must escape.

  • We’ve hardwired the variables that need renaming, along with replacements carefully chosen to avoid conflicts with the rest of the code. We could have written the source with single-letter variables, but then we’d miss an opportunity to show off and test new features.

Our previous compiler should accept the result:

\x.\y.\f.fxy;\x.\f.\g.gx;\x.\s.@!(@ xs);\f.\m.mK(\x.xf);\x.\y.\s.@#(\a.\t.@#(\b.\u.@"(ab)u)(yt))(xs);\f.\x.@$(@"f)x;\x.\y.\s.(xs)(ys)@!;\f.\x.\y.@$(@%fx)y;Y\r.\p.@&(@':p(rp))(@"K);\p.@':p(@(p);@'(KI);@'K;\f.\s.sK(\h.\t.fh(@"ht)K);\c.@,(\x.x(c=));\s.\a.\b.\c.\d.as;\v.\a.\b.\c.\d.bv;\x.\y.\a.\b.\c.\d.cxy;\x.\y.\a.\b.\c.\d.dxy;@*(@-#-)(@*(@-#-)(@((@,(\c.C(c(#
=))))));@((@&(@-# )(@&(@-#
)@2));\f.@+f@3;B@4@-;@4(@)(@,(\x.(#z(xL))(x(#aL))(KI))));Y\r.\c.\n.\l.ln(\h.\t.ch(rcnt));@%(:)(@4(@,(KK)));@$(@&(@%(KI)(@-#@))(@%(BB:)(@-##)))@8;\r.@*(@5#\)(@'(C(@7@1))(@)@6)(@*(@-#-)(@*(@5#>)r)));\r.@&(@%@/@6)(@&(@*(@5#()(@+r(@5#))))(@&(@%@.@9)(@:r)));Y\f.\r.@&(@'T(@;r)(@%(\p.\v.\x.p(@0xv))(fr)))(@"I);Y\r.@'T(@;r)(@<r);@'@ @6(@'(C(@7@1))(@(@6)(@*(@5#=)@=));@*@3(@)(@+@>(@5#;)));Y\r.\p.\q.\a.\b.p(qa(\u.\u.b))(\x.\t.qb(\y.\u.x(y=)(rtuab)b));\p.\v.@7(\d.\t.@@v(dK)(\n.B(:#@)(:n))(Bt\n.#0(#1-)(n+)))?p# ;\f.Y\r.\t.tIf(\x.\y.B(B(:#`)(rx))(ry))?;\f.Y\r.\t.t(\x.KI)f(\x.\y.(rx)K(ry))?;\f.Y\r.\t.ft(t?(K(@.(:#I)))(\x.\y.@0(@0(@.(:#S))(rx))(ry))?)(@0(@.(:#K))t);Y\r.\t.t@.@/(\x.\y.@0(rx)(ry))(\x.\y.@D(@C(@@x))(ry));\a.@7(\h.\t.@B(@Aa)(@E(h(KI)))(:#;t))Ka;\s.@?s(:#?K)(B@F(TK));

Ben Lynn blynn@cs.stanford.edu 💡