Import:
= Parser Combinators =

Suppose we write a function to parse an Integer. What type should it have?

We might try +String -> Integer+:


[ ]:
parseInteger :: String -> Integer
parseInteger = foldl (\n d -> 10*n + fromIntegral (fromDigit d)) 0

fromDigit c = ord c - ord '0'

parseInteger "123" + parseInteger "456"

The read function generalizes this, and it suffices for some
applications. Nevertheless, often we'd prefer a parser that also:

* handles invalid input
* allows trailing characters

Our design should take into account that we may want to parse any trailing
characters with another function.

== The hard way ==

One solution is a function that returns all valid ways to parse an Integer at
the start of a string along with the remainder of the string. In particular, it
returns the empty list on failure.

----------------------------------------------------------------
String -> [(Integer, String)]
----------------------------------------------------------------

The reads function generalizes this, and
https://webdoc.sub.gwdg.de/ebook/serien/ah/UU-CS/2008-044.pdf[Doaitse
Swierstra, _Combinator Parsing: A Short Tutorial_] builds upon this approach.

However, we'll insist on at most one correct way to parse a string.
We also provide a mechanism for reporting an error message:

----------------------------------------------------------------
String -> Either String (Integer, String)
----------------------------------------------------------------

This type accommodates parser functions that take a string and return either an
error message or the successfully parsed integer at the start of the string
along with the remainder of the input.

With this design, we might write a function to parse a single digit as follows:


[ ]:
parseDigit = \case
c:s' | '0' <= c, c <= '9' -> Right (fromDigit c, s')
_ -> Left "want digit"

parseDigit "2b"
parseDigit "b2"

From this we can build a parser for a two-digit number:


[ ]:
parseTwoDigits s = do
(a, s') <- parseDigit s
(b, s'') <- parseDigit s'
pure (10*a + b, s'')

parseTwoDigits "221b"
parseTwoDigits "2b21"

It works, but our design is infested with boilerplate. We must carefully manage
the input string, ensuring that the unparsed remainder of one step is fed to
the next. The pain worsens as we write more complex parsers.

We would prefer to write plumbing code like this once, and reuse it many times.
This is easy in Haskell.

== The easy way ==

We start by defining a data type to hold a parsing function, which we call a
Charser, a sort of portmanteau meant to evoke a parser that handles one Char
at a time. We also generalize so that our parsers can return any desired type.


[ ]:
data Charser a = Charser { unCharser :: String -> Either String (a, String) }

Next, we define some fundamental parsers. The sat parser returns the first
character if it satisfies a given predicate, and fails otherwise. THe eof
parser returns () if the given string is empty, and fails otherwise.


[ ]:
sat f = Charser \case
h:t | f h -> Right (h, t)
_ -> Left "unsat"

eof = Charser \case
[] -> Right ((), "")
_ -> Left "want EOF"

We can easily get at the function via unCharser:


[ ]:
unCharser (sat (== 'x')) "xyz"
unCharser (sat (== 'x')) "abc"
unCharser eof ""
unCharser eof "abc"

Lastly, we define Functor, Applicative, Monad, and Alternative
typeclass instances for the Charser type. The Functor instance is standard
stuff (that GHC can derive automatically) that lets us easily operate on a
parsed value, while the others describe how to compose two parsers. The Monad
and Applicative instances meticulously thread the input string from the first
parser to the second parser, while the Alternative instance builds a parser
that tries the first parser, and on failure tries the second parser.

These correpond to sequencing and the choice operator denoted by a vertical bar
"|" in a https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form[BNF grammar],
though in our framework we iterate through the choices in the order of
appearance and take the first match.


[ ]:
instance Functor Charser where fmap f (Charser x) = Charser $fmap (first f) . x instance Applicative Charser where pure a = Charser$ Right . (a,)
f <*> x = Charser \s -> do
(fun, t) <- unCharser f s
(arg, u) <- unCharser x t
pure (fun arg, u)
instance Monad Charser where
Charser f >>= g = Charser $(uncurry (unCharser . g) =<<) . f return = pure instance Alternative Charser where empty = Charser$ const $Left "" x <|> y = Charser \s -> either (const$ unCharser y s) Right $unCharser x s  And we're done! We have just defined _parser combinators_, which we compose to build complex parsers without needing to juggle bits and pieces of input strings. Our two-digit parser example becomes:  [ ]: digitChar = sat \c -> '0' <= c && c <= '9' twoDigits = do a <- digitChar b <- digitChar pure$ 10*fromDigit a + fromDigit b

unCharser twoDigits "221b"
unCharser twoDigits "2b21"

== Arbitrary-precision calculator ==

We demonstrate a more sophisticated parser that parses an integer arithmetic
expression. Rather than store an expression in a tree data structure, we
evaluate as we parse.


[ ]:
char :: Char -> Charser Char
char = sat . (==)

oneOf :: String -> Charser Char
oneOf s = sat (elem s)

chainl1 :: Charser a -> Charser (a -> a -> a) -> Charser a
chainl1 p infixOp = go =<< p where
go x = (go =<< ($x) <$> infixOp <*> p) <|> pure x

chainr1 :: Charser a -> Charser (a -> a -> a) -> Charser a
chainr1 p infixOp = (&) <$> p <*> (flip <$> infixOp <*> chainr1 p infixOp <|> pure id)

sp :: Charser a -> Charser a
sp p = p <* many (oneOf " \n")  -- Eat trailing spaces.

spch = sp . char

op c f = spch c *> pure f

line = sp (pure ()) *> expr <* eof where
num  = parseInteger <$> sp (some digitChar) atm = num <|> spch '(' *> expr <* spch ')' pow = chainr1 atm$ op '^' (^)
una  = foldr (.) id <$> many (op '+' id <|> op '-' negate) fac = una <*> pow term = chainl1 fac$ op '*' (*) <|> op '/' div <|> op '%' mod
expr = chainl1 term $op '+' (+) <|> op '-' (-) unCharser line "3 + 4 * 5"  Our line parser closely resembles the grammar it parses: ------------------------------------------------------------------------------ num ::= ('0'|..|'9'|'.')+ atm ::= num | '(' expr ')' pow ::= atm ('^' atm )* una ::= ('+'|'-')* fac ::= una pow term ::= fac ( ('*'|'/'|'%') fac )* expr ::= term ( ('+'|'-') term )* ------------------------------------------------------------------------------ In fact, one might say the definition of line is an https://en.wikipedia.org/wiki/Attribute_grammar[attribute grammar]. We omit the rules involving spaces, which are traditionally considered part of a separate phase called lexical analysis. With parser combinators, such divisions are blurrier, and it can be practical to combine the two phases into a single catch-all parsing phase. We embed the parser in a calculator web widget. Type an expression into the textbox and press enter.  [ ]: calc = do s <- jsEval "calcIn.value;" putStr$ case unCharser line s of
Left err -> err
Right (v, _) -> s ++ " = " ++ show v

jsEval [r|
<textarea style="border: solid 4px #999999" id="calcIn" rows="1" cols="40">8888+88 / 8+8 / 8+8 * 8</textarea>
<div id="calcOut"></div>
);

calcIn.addEventListener("keydown", ev => {
if (ev.keyCode == 13) {
const r = repl.run("chat", ["Main"], "calc");
calcOut.innerText = r.out;
calcIn.value = "";
ev.preventDefault();
}
});
|]

== The Fast Way ==

Suppose we have a complex grammar and a long input that fails to parse near the
end. Due to our implementation of (<|>), after the first failure, the parser
will backtrack and explore every alternative as it unwinds, before ultimately
giving up anyway.

Not only is this slow, but backtracking also discards the error thrown by the
original failure, leading to error messages that are difficult to understand.

To address these issues, popular parser combinator libraries change the meaning
of (<|>). If the first alternative fails _and_ no input has been consumed,
only then is the second alternative attempted. That is, if a parser processes
even a single character before failing, then the parser halts immediately with
failure.

Short-circuiting failure is indeed faster, but introduces a problem. At times
we really do want backtracking, that is, we want to undo consuming some of the
input before trying an alternative. Thus parser combinator libraries also
provide a try function: it tries a given parser on the input, and on failure,
acts as if no input has been consumed so that alternatitves can be tried.
Naturally,
http://blog.ezyang.com/2014/05/parsec-try-a-or-b-considered-harmful/[we should
use try sparingly].

One way to implement this feature is to have parsers return a boolean
indicating whether input has been consumed.


[ ]:
data Parser a = Parser
{ unParser :: String -> (Bool, Either String (a, String)) }

instance Functor Parser where
fmap f (Parser x) = Parser $second (fmap$ first f) . x
instance Applicative Parser where
pure a = Parser $(False,) . Right . (a,) f <*> x = Parser \s -> let (c0, mf) = unParser f s in case mf of Left err -> (c0, Left err) Right (fun, t) -> let (c1, mg) = unParser x t in case mg of Left err -> (c0 || c1, Left err) Right (arg, u) -> (c0 || c1, Right (fun arg, u)) instance Monad Parser where Parser f >>= g = Parser \s -> let (c0, mf) = f s in case mf of Left err -> (c0, Left err) Right (x, t) -> let (c1, mg) = unParser (g x) t in case mg of Left err -> (c0 || c1, Left err) r -> (c0 || c1, r) return = pure instance Alternative Parser where empty = Parser$ const (False, Left "")
x <|> y = Parser \s -> let cm@(consumed, m) = unParser x s in case m of
Left err -> if consumed then cm else unParser y s
Right _ -> cm

try p = Parser \s -> let cm@(_, m) = unParser p s in case m of
Left err -> (False, Left err)
_ -> cm

sat f = Parser \case
h:t | f h -> (True, Right (h, t))
_ -> (False, Left "unsat")

(We call this version Parser instead of Charser because my interprefer has
issues with redefining typeclasses.)

Here's some examples showing how the new (<|>) behaves:


[ ]:
char = sat . (==)
string = mapM char

unParser (string "foo" <|> string "bar") "foo"
unParser (string "foo" <|> string "bar") "bar"
unParser (string "foo" <|> string "bar") "baz"
unParser (string "foo" <|> string "bar") "qux"
unParser (string "qux" <|> string "quux") "quux"
unParser (try (string "qux") <|> string "quux") "quux"

== The evolution of parsing ==

As a child, I was baffled by parsers because I only understood loops.
How does the computer handle arbitrarily nested parentheses?!

Parsers might have been fundamentally beyond my reach at the time, as some
research suggests that
https://modeltheory.org/papers/2018children-recursion.pdf[10-year-olds struggle
with the concept of recursion, while 11-year-olds struggle with understanding
recursive programs]. As it happened, I was 12 or so when I read about
https://en.wikipedia.org/wiki/Recursive_descent_parser[recursive descent
parsers]. (For memory, the book was Herbert Schildt, _Advanced Turbo C_.) I was
awestruck by their simplicity, elegance, and power. I felt I had leveled-up in
real life. I now knew everything about parsers.

I was awestuck again on encountering
https://en.wikipedia.org/wiki/Compilers:_Principles,_Techniques,_and_Tools[the
"red dragon book"] at university.
Forget those childish recursive descent parsers! Instead, the computer should
do the hard work; the human should just write a handful of regular expressions
and a BNF grammar. Not only is it simple and elegant, but there are
strong guarantees: the parser generator produces efficient code and detects
ambiguities. I felt I knew everything about parsers. This time for real.

I was awestruck yet again years later when I stumbled upon
https://en.wikipedia.org/wiki/Parser_combinator[parser combinators].
Forget those cumbersome parser generators! I had thought combinators were a
mathematical curiosity; a fun alternative to Turing machines for studying
computability. But parsing functions that return a parsing function in addition
to performing other duties lead to amazingly succinct recursive descent
parsers.

But the shocks didn't stop. I stumbled upon
http://bford.info/pub/lang/peg.pdf[parsing expression grammars], or PEGs, which
can be parsed in linear time and are unambiguous. They can parse some
non-context-free languages, and it is unknown if they can parse all
context-free langauges.

And then there's
link:pwd.html[parsing with derivatives], which
https://arxiv.org/pdf/1604.04695.pdf[parse any context-free grammar in cubic
time] (on par with Earley's algorithm or CYK). In brief, "even though
link:re.html[Brzozowski defined the derivative with regular languages] in mind,
it works unaltered for context-free languages if you toss in laziness,
memoization and fixed points."

I now feel I know little about parsers. In practice, I let expedience be my
guide. If I must have features guaranteed by the theory, or if parser
combinators are unavailable, then I'll resort to a parser generator. Otherwise,
I'll choose a parser combinator library such as