ION Assembly

By "computer program", we often mean a string intended for human comprehension. In contrast, language specifications usually involve more abstract objects such as syntax trees, where we can ignore details such as which characters are treated as whitespace.

A parser maps a string to a syntax tree, or an error if the input string is not a valid program.

parse :: [Char] -> Either Error Language

Parsing is often viewed as part of the compiler’s job:

parseCompile :: [Char] -> Either Error TargetLanguage
parseCompile s = case parse s of
  Left  err  -> Left  err
  Right prog -> Right (compile prog)

Our TargetLanguage is a sort of assembly language for the ION machine, where we use a string representation of a list of CL terms. (Why a list instead of a single CL? See below.)

type TargetLanguage = [Char]
assemble :: TargetLanguage -> Either Error [CL]

We also need some way to load an ION machine with a [CL]:

load :: [CL] -> VM

We build assemble and load below:

▶ Toggle extensions and imports

Grammar

We define a language that is easy to generate and parse, and that humans can tolerate in small doses:

digit = '0' | ... | '9'
num = digit [num]
con = '#' CHAR | '(' num ')'
idx = '@' CHAR | '[' num ']'
comb = 'S' | 'K' | 'I' | ...
term = comb | '`' term term | con | idx
prog = term ';' [prog]

A program is a sequence of terms terminated by semicolons. The entry point is the last term of the sequence.

The motivation for a list instead of a single term is that rather than:

norm = (\sq x y -> sq x + sq y) (\x -> x * x)

we’d like to write the sequence of definitions:

square x = x * x
norm x y = square x + square y

The semicolon terminators in our grammar are unnecessary, but aid debugging.

The backquote is a prefix binary operator denoting application. For example, we represent the program BS(BB) with:

``BS`BB;

A term is represented by the last element of a sequence of expressions terminated by semicolons. The sequence is 0-indexed, and we may refer to an earlier term by enclosing its index in square brackets.

Thus another way to represent BS(BB) is:

B;S;``[0][1]`[0][0];

An term may refer to an earlier term using the (@) prefix unary operator. Let n be the ASCII code of the character following (@). Then it refers to the term of index n - 32. This offset is chosen so that short programs can be written in printable ASCII without worrying about decimal conversion.

For example, instead of "[0]" we may write "@ " and instead of "[10]" we may write "@*".

If an earlier term is needed multiple times, then we share rather than duplicate. Accordingly, we define combinatory logic terms to be combinators, applications, and indexes of previously defined terms:

data CL = Com Int | CL :@ CL | Nat Int | Idx Int

An 32-bit word constant is represented by a number in parentheses. For example, 42 is represented by (42).

An alternative representation is to use the unary prefix operator (#), in which case the constant is the ASCII code of the following character. Again, this is so we can generate code without worrying about converting to decimal.

For example, instead of (42) we may write #*.

Parser

We use Megaparsec to build a recursive descent parser for ION assembly.

digit = digitChar
num = read <$> some digit
nat = ord <$> (char '#' *> anySingle)
  <|> char '(' *> num <* char ')'
idx = (+(-32)) . ord <$> (char '@' *> anySingle)
  <|> char '[' *> num <* char ']'
comb = Com . ord <$> oneOf "SKIBCTRY:L=+-/*%?"
term = comb
  <|> ((:@) <$> (char '`' *> term) <*> term)
  <|> Nat <$> nat
  <|> Idx <$> idx

prog :: Parsec () [Char] [CL]
prog = some (term <* char ';')

Thus our assemble function is simply:

assemble = parse prog ""

We load terms onto a VM as follows.

fromExpr :: [(Int, Int)] -> CL -> VM -> (Int, VM)
fromExpr defs t vm = case t of
  Com n -> (n, vm)
  x :@ y -> let
    (a, vm1) = fromExpr defs x vm
    (b, vm2) = fromExpr defs y vm1
    in ION.app a b vm2
  Nat n -> ION.app 35 n vm
  Idx n | Just addr <- lookup n defs -> (addr, vm)

load :: [CL] -> VM
load ts = ION.push root vm1 where
  ((_, root):_, vm1) = foldl addDef ([], ION.new) ts
  addDef (defs, m) t = let (addr, m') = fromExpr defs t m
    in ((length defs, addr):defs, m')

A helper turns a given pure CL program into one that runs on standard input and output by mapping a term x to x(0?)(.)(T1):

wrapIO :: CL -> CL
wrapIO x = x :@ (Com 48 :@ Com 63) :@ Com 46 :@ (Com 84 :@ Com 49)

Lastly, a main function which runs the program in the file whose name is supplied as the first argument:

main :: IO ()
main = getArgs >>= \as -> case as of
  []  -> putStrLn "Must provide program."
  [f] -> do
    s <- readFile f
    case assemble s of
      Left err -> putStrLn $ "Parse error: " <> show err
      Right cls -> do
        void $ ION.evalIO $ load $ cls ++ [wrapIO $ Idx $ length cls - 1]
  _   -> putStrLn "One program only."

For example:

$ echo -n 'I;' > id
$ echo "Hello" | ion id
Hello
$ echo -n '``C`T?`KI;' > tail
$ echo "tail" | ion tail
ail

Parsing Pitfalls

Sadly, we largely ignore the topic of parsing because we wish to focus on lambda calculus, combinatory logic, and types. We get by with parser combinators, which we hardly explain.

This is a pity, because my textbooks omitted several fascinating facts, partly due to their vintage:

  • Many recommend Thompson’s construction algorithm for building DFAs from regular expressions. In fact, regular expression derivatives are superior. (In practice, NFAs may be better to because conversion to DFAs may blow up exponentially.) See also Fun with Semirings for lesser-known connections between regular languages and other parts of computer science.

  • Parsing combinators take the tedium out of recursive descent parsers, at least in languages like Haskell.

  • Parsing with derivatives makes it easy to write an efficient context-free parser generator.

  • Parsing expression grammars can be parsed in linear time and can even handle some context-sensitive grammars, though it is unknown whether they can handle all context-free grammars.


Ben Lynn blynn@cs.stanford.edu 💡