# Parsing With Derivatives

Define a context-free grammar with start symbol "S", enter an input string, then hit the "Parse" button. We randomly pick one tree from the parse forest to display; clicking the button again rerolls the dice.

Concatenation is right-associative, e.g. "abc" means "a(bc)".

Large inputs fail because of relatively low browser limits, a relatively immature transpiler, and our prototype-quality code.

context-free grammar:

input string:

See:

## The source code

We define a Pe data structure to represent parsing expressions, that is, the right-hand side of the production rules of a grammar.

{-# LANGUAGE CPP #-}
#ifdef __HASTE__
{-# LANGUAGE PackageImports #-}
#endif
import Control.Arrow
#ifdef __HASTE__
import Haste.DOM
import Haste.Events
#else
#endif
import Data.Maybe
import qualified Data.Map as M
import qualified Data.Set as S
import System.Random
import Text.Parsec hiding (State)

-- NT = non-terminal. (:.) = concatenation.
data Pe = NT String | Eps Char | Nul | Ch Char | Or [Pe] | Pe :. Pe | Del Pe

Although it represents the empty string, the Eps (for epsilon) expression holds a character that winds up in the abstract syntax trees (AST) returned by the parser. Similarly, the Del (for delta) expression, which is only generated internally, holds an expression which later helps build ASTs.

A context-free grammar maps non-terminal symbols to parsing expressions:

type Grammar = M.Map String Pe

Our ASTs are full binary trees whose leaf nodes are characters (the free magma on the alphabet). The tree structure captures the order the production rules are applied.

We store a parse forest, that is, all possible ASTs. There may be exponentially many, but they are related in such a way that we can store them compactly.

-- Uni stands for "union".
data Ast = Lf Char | Ast :@ Ast | Uni [Ast] deriving Show

A fancier parser might return a parse forest, that is, all possible parse trees for a given input. Ours will simply pick one possible parse tree.

## Derivative work

To parse an input string, we first take successive derivatives of the start symbol with respect to each character of the input, taking care to leave bread crumbs in the Eps and Del expressions to record consumed characters. (The Del constructor is named for the delta symbol from the paper, but I also think of it as "deleted", because it remembers what has just been deleted from the input.)

Then the string is accepted if and only if the resulting expression is nullable, that is, accepts the empty string. As we traverse the expression to determine nullability, we also build ASTs to return.

We memoize derivatives by adding entries to a state of type Grammar. Initially, this cache contains only the input grammar, mapping nonterminal symbols to Pe values. Later, we place a derivative at the key formed by concatenating the characters involved in the derivative with the nonterminal symbol being derived.

For example, if S is a nonterminal symbol in the input grammar, then the key abS maps to derive 'a' (derive 'b' (NT "S")). We assume no nonterminal symbol in the input grammar is a suffix of any other nonterminal symbol, which is fine for a prototype.

It may help to imagine the grammar growing over time, gaining new production rules as we process input characters. Indeed, we consider nonterminals to refer to both nonterminals in the input grammar as well as their derivatives.

pwd :: Grammar -> String -> String -> Maybe Ast
pwd g start s = evalState (parseNull $NT$ reverse s ++ start) g

## Nothing matters

Computing nullability requires finding a least fixed point. I found this the toughest part of the algorithm, partly because they never taught fixed point theory when I was in school. For some reason, the method reminds me of Hopcroft’s algorithm to minimize a DFA, where we repeately refine a partition until we reach a stable answer.

We initially guess that each nonterminal is not nullable, which means it corresponds to Nothing. On encountering a nonterminal, if we’ve already seen it, then return our guess for that nonterminal. Otherwise, it’s the first time we’ve seen it and instead of guessing, we recursively traverse its corresponding expression. In doing so, we may discover our guess is wrong, so we correct it if necessary before returning ASTs.

We repeat until our guesses have stabilized. Guesses never change from something to Nothing, and the map of all guesses only changes if one of them changes from Nothing to something. We exploit these facts to simplify our code slightly.

parseNull :: Pe -> State Grammar (Maybe Ast)
parseNull pe = leastFix M.empty where
leastFix guessed = do
(b, (_, guessed')) <- runStateT (visit pe) (S.empty, guessed)
if M.size guessed == M.size guessed' then pure b else leastFix guessed'

visit :: Pe
-> StateT (S.Set String, M.Map String Ast) (State Grammar) (Maybe Ast)
visit pe = case pe of
Eps x  -> pure $Just$ Lf x
Del x  -> visit x
Nul    -> pure $Nothing Ch _ -> pure$ Nothing
Or xs  -> uni <$> mapM visit xs x :. y -> mul <$> visit x <*> visit y
NT s -> do
(seen, guessed) <- get
case () of
() | Just x <- M.lookup s guessed -> pure $Just x | S.member s seen -> pure Nothing | otherwise -> do modify$ first $S.insert s b <- visit =<< lift (memoDerive s) maybe (pure ()) (modify . second . M.insert s) b pure b mul :: Maybe Ast -> Maybe Ast -> Maybe Ast mul x y = (:@) <$> x <*> y

uni :: [Maybe Ast] -> Maybe Ast
uni xs | null xs'   = Nothing
| [x] <- xs' = Just x
| otherwise  = Just $Uni xs' where xs' = concatMap deUni$ catMaybes xs
deUni (Uni a) = a
deUni x       = [x]

We add a lightweight variant of the above that returns a boolean instead of a parse forest.

nullable :: Pe -> State Grammar Bool
nullable pe = leastFix S.empty where
leastFix guessed = do
(b, (_, guessed')) <- runStateT (popIn pe) (S.empty, guessed)
if S.size guessed == S.size guessed' then pure b else leastFix guessed'
-- Like visit, but faster.
popIn :: Pe -> StateT (S.Set String, S.Set String) (State Grammar) Bool
popIn e = case e of
Eps _  -> pure True
Del _  -> pure True
Nul    -> pure False
Ch _   -> pure False
Or xs  -> or <$> mapM popIn xs x :. y -> (&&) <$> popIn x <*> popIn y
NT s -> do
(seen, guessed) <- get
case () of
() | S.member s guessed -> pure True
| S.member s seen -> pure False
| otherwise -> do
modify $first$ S.insert s
b <- popIn =<< lift (memoDerive s)
when b $modify$ second $S.insert s pure b ## You must remember this Memoized derivatives are straightforward. For computing derivatives, we translate the rules given in the paper, and for memoization, on discovering a missing entry, we insert a knot-tying value before recursing, and replace it with the result of the recursion afteward. memoDerive :: String -> State Grammar Pe memoDerive cs@(c:s) = do m <- get unless (M.member cs m)$ do
modify $M.insert cs$ NT cs
d <- derive c =<< memoDerive s
modify $M.insert cs d gets (M.! cs) memoDerive _ = error "unreachable" derive :: Char -> Pe -> State Grammar Pe derive c pe = case pe of NT s -> pure$ NT $c:s Ch x | x == c -> pure$ Eps x
Or xs            -> Or <$> mapM (derive c) xs Del x :. y -> (Del x :.) <$> derive c y
x :. y           -> do
b <- nullable x
dx <- derive c x
if not b then pure $dx :. y else do dy <- derive c y pure$ Or [dx :. y, Del x :. dy]
_                -> pure Nul

## A grammar for grammars

I should be eating my own dogfood and using parsing with derivatives to read the definition of a context-free grammar, but it’ll have to wait until I add more features. For now, we use parser combinators.

cfg :: Parsec String () Grammar
cfg = M.fromList <$> between filler eof (many1 rule) where rule = (,) <$> sym <*> between (want "=") (want ";") expr
expr = Or <$> cat sepBy1 want "|" cat = foldr1 (:.) <$> many1 atm
atm = str <|> (NT <$> sym) sym = many1 alphaNum <* filler str = do s <- between (char '"') (want "\"")$ many $noneOf "\"" pure$ if null s then Eps '\949' else foldr1 (:.) $Ch <$> s
want :: String -> Parsec String () ()
want s = string s >> filler
filler = skipMany $com <|> void space com = try (string "--") >> skipMany (noneOf "\n") ## One of a kind Our Ast data structure holds a parse forest, which in general may be too large to show. Hence this function that shows a tree at random. showOne :: (String -> String) -> StdGen -> Ast -> String showOne paren g ast = case ast of Uni xs -> let (n, g1) = randomR (0, length xs - 1) g in showOne paren g1$ xs !! n
Lf c -> [c]
x :@ y -> let (gx, gy) = split g
in paren $showOne addParen gx x ++ showOne id gy y where addParen s = concat ["(", s, ")"] ## Frontend View the HTML source to see the hidden textareas that we harvest below. #ifdef __HASTE__ main :: IO () main = withElems ["grammar", "str", "out", "parse"]$
\[gEl, sEl, oEl, parseB] -> do
let
handle demo = do
Just b <- elemById $demo ++ "B" void$ b onEvent Click $const$ preset demo
preset demo = do
Just g <- elemById $demo ++ "G" Just s <- elemById$ demo ++ "S"
setProp oEl "value" ""
setProp gEl "value" =<< getProp g "value"
setProp sEl "value" =<< getProp s "value"
handle "1+1"
handle "par"
handle "pal"
preset "1+1"
void $parseB onEvent Click$ const $do setProp oEl "value" "" mg <- parse cfg "" <$> getProp gEl "value"
case mg of
Left e -> setProp oEl "value" $"error: " ++ show e Right g -> do s <- getProp sEl "value" rnd <- newStdGen setProp oEl "value"$ if M.member "S" g
then maybe "[parse failed]" (showOne id rnd) $pwd g "S" s else "missing start symbol S" #else main :: IO () main = print$ pwd g "S" $concat (replicate 39 "1+") ++ "+1" where Right g = parse cfg ""$ unlines
[ "S = T;"
, "T = T \"+\" T | N;"
, "N = \"1\";"
]
#endif

Ben Lynn blynn@cs.stanford.edu 💡