Parser Combinators

The above demo expects expressions conforming to a simple calculator grammar (the numbers and symbols have their usual meaning):

num    ::= ('0'|..|'9'|'.')+
factor ::= ('+'|'-')* ( num | '(' expr ')' )
term   ::= factor ( ('*'|'/') factor )*
expr   ::= term ( ('+'|'-') term )*
line   ::= expr EOF

We’ll see how to parse and evaluate expressions with code that takes about the same space as the grammar. But first, we need a few imports so we can show off our interpreter on this webpage:

import Control.Monad
import Haste.DOM
import Haste.Events

The interpreter itself closely follows the grammar. The main difference is one extra line for unary operators:

-- I recommend Megaparsec over Parsec.
-- I'm forced to use Parsec here because of Haste.
import Text.Parsec

line :: Parsec String () Double
line = expr <* eof where
  num  = read <$> many1 (digit <|> char '.')
  una  = product <$> many ((char '+' >> pure 1) <|> (char '-' >> pure (-1)))
  fac  = (*) <$> una <*> num <|> between (char '(') (char ')') expr
  term = fac  `chainl1` ((char '*' >> pure (*)) <|> (char '/' >> pure (/)))
  expr = term `chainl1` ((char '+' >> pure (+)) <|> (char '-' >> pure (-)))

Lastly, we need code that pulls from the input text area when Enter is hit and writes the corresponding result to the output text area then scrolls down. For convenience, we strip out spaces from the input string.

main :: IO ()
main = withElems ["input", "output"] $ \[iEl, oEl] ->
  void $ iEl `onEvent` KeyDown $ \key -> when (key == mkKeyData 13) $ do
    s <- filter (/= ' ') <$> getProp iEl "value"
    setProp iEl "value" ""
    hist <- (++ ("> " ++ s ++ "\n")) <$> getProp oEl "value"
    setProp oEl "value" $ (hist ++) $ case parse line "" s of
      Left e -> "Error: " ++ show e ++ "\n"
      Right d -> show d ++ "\n"
    getProp oEl "scrollHeight" >>= setProp oEl "scrollTop"

To build this demo yourself, install Haste and AsciiDoc, and then type:

$ haste-cabal install parsec
$ wget
$ hastec parse.lhs
$ sed 's/^\\.*{code}$/-----/' parse.lhs | asciidoc -o - - > parse.html

Open parse.html in a browser to see it action.

Patching our partial parser

Our use of the partial function read should be viewed with suspicion, and indeed, we find inputs such as "1.2.3" cause exceptions. We can fix this with readMaybe from Text.Read:

num = do
  s <- many1 (digit <|> char '.')
  case readMaybe s of
    Just x -> pure x
    Nothing -> fail "bad float" <?> "number"

Using readFloat from Numeric instead is similar.

Unfortunately, a Haste bug means these solutions fail. To work around the bug, we can roll our own floating-point reading routine. If we’re taking the trouble to do this, we may as well also modify the grammar to accept at most one decimal point, which produces friendlier error messages:

num = do
  s <- many1 digit
  t <- option "" $ char '.' >> many1 digit
  pure $ fromIntegral (read $ s ++ t) / fromIntegral (10 ^ length t)

Alternatively, we could use the Parsec library’s routines:

import Text.Parsec.Language
import Text.Parsec.Token

num = do
  e <- naturalOrFloat $ makeTokenParser emptyDef
  pure $ case e of
    Left  n -> fromIntegral n
    Right d -> d

This also adds support for all numeric literals defined in the Haskell report such as "1.23e-4" or "0xface".

It turns out Parsec’s Expr module is specifically designed for expression grammars like ours, and can build a parser from a supplied table of operators. However, it hides the interesting part of the library, namely the combinators. Also, using the module hardly saves any lines of code in our case.

Do or do not

Use try sparingly to prevent excessive backtracking and confusing error messages. If try must be used, then sometimes lookahead may help clear up the error messages by also rewinding the position on failure.

I read about using breadth-first search in conjunction with parser combinators to avoid try, but haven’t tried it yet. I also read about a parser combinator library focusing on diagnostics, but haven’t tried that either.

I’ve seen dubious claims that lexing and parsing should always be combined when using parser combinators. While great for some grammars, blindly following this rule can clutter the code and make try harder to eliminate.

It can be better to tokenize the input stream with parser combinators and output a list of lexemes (which is really a stream), then parse this list of lexemes with another set of parser combinators. In Parsec, we can write a thin wrapper to pass position information from the tokenizer to the parser so that error reporting is unaffected.

Indeed, we can go further and trivially add more layers beyond a parser and a lexer. Doing this with parser generators and dedicated lexers is cumbersome because it involves multiple languages.

The evolution of parsing

As a child, I was baffled by parsers because I only pursued iterative approaches. How does the computer handle arbitrarily nested parentheses?!

I was awestruck when I learned about recursive descent parsers. (For memory, it was Advanced Turbo C by Herbert Schildt.) Simple, elegant, yet so powerful. I felt I knew everything about parsers.

I was awestuck again on encountering the "red dragon book" at university. Forget those childish recursive descent parsers! Instead, the computer should do the hard work; the human should just write a handful of regular expressions and a BNF grammar. Not only is it simple and elegant, but there are strong guarantees: the parser generator produces efficient code and detects ambiguities. I felt I knew everything about parsers. This time for real.

I was awestruck yet again years later when I stumbled upon parser combinators. Forget those cumbersome parser generators! I had thought combinators were a mathematical curiosity; a fun alternative to Turing machines for studying computability. But parsing functions that return a parsing function in addition to performing other duties lead to amazingly succinct recursive descent parsers.

More recently, I heard about parsing expression grammars, or PEGs, which can be parsed in linear time and are unambiguous. They can parse some non-context-free languages, and it is unknown if they can parse all context-free langauges.

And then there’s parsing with derivatives, which parse any context-free grammar in cubic time (on par with Earley’s algorithm or CYK). In brief, "even though Brzozowski defined the derivative with regular languages in mind, it works unaltered for context-free languages if you toss in laziness, memoization and fixed points."

I now feel I know hardly anything about parsers.

Knuth intends to write about parsers in Volume 5 of The Art of Computer Programming. I once thought it would be almost pointless; why not just read the dragon book?

But now I’d love a comprehensive reference that contained parser combinators, PEGs, and parsing with derivatives. Moreover, in order to teach parser combinators and parsing with derivatives, it seems one would have to teach combinators, lambda calculus, lazy evaluation, fixed points, type theory, and so on. It would be great for these topics to gain wider exposure, and great to see them neatly explained in a single series of books.

These days, I let expedience be my guide: if I must have features guaranteed by the theory, or if parser combinators are unavailable, then I’ll resort to a parser generator. Otherwise, I’ll choose parser combinators.

Ben Lynn 💡