# Parsing With Derivatives

Hit the "Parse" button to parse a given input with a given context-free grammar with start symbol "S". We randomly pick one tree from the parse forest to display. Clicking again rerolls the dice.

We also show the entire parse forest, using sharing to avoid exponential output. Try something like "1+1+1+1" for a more readable example.

context-free grammar:

input:

parse forest:

See:

## Derivative work

We define a Pe data structure to represent parsing expressions, that is, the right-hand side of the production rules of a grammar.

{-# LANGUAGE CPP #-}
#ifdef __HASTE__
{-# LANGUAGE PackageImports #-}
import Haste.DOM
import Haste.Events
#else
#endif
import Data.List (intersperse)
import Data.Maybe
import qualified Data.Map as M
import System.Random
import Text.Parsec hiding (State)

-- NT = non-terminal. (:.) = concatenation.
data Pe = NT String | Eps Char | Nul | Ch Char | Or [Pe] | Pe :. Pe | Del Pe


Although it represents the empty string, the Eps (for epsilon) expression holds a character that winds up in the abstract syntax trees (AST) returned by the parser. Similarly, the Del (for delta) expression, which is only generated internally, holds an expression which later helps build ASTs.

Our ASTs are full binary trees whose leaf nodes are characters (the free magma on the alphabet). The tree structure captures the order the production rules are applied.

We compute a parse forest, that is, all possible ASTs. It may contain exponentially many trees, but we represent it compactly by sharing. For this reason, our AST supports indirection nodes, and we keep around a map from strings to ASTs.

We also maintain another similar map to help compute nullability.

Lastly, we turn the grammar into yet another map to which we add more entries as we parse.

data Ast = Lf Char | Ast :@ Ast | Union [Ast] | Ind String deriving Show

data PWD = PWD
{ nullMemo :: M.Map String Bool
, astMemo :: M.Map String Ast
, grammar :: M.Map String Pe
}


To parse an input string, we repeatedly derive the start symbol with respect to each character of the input, taking care to leave bread crumbs in the Eps and Del expressions to record consumed characters. (The Del constructor is named for the delta symbol from the paper, but I also think of it as "deleted", because it remembers what has just been deleted from the input.)

Then the string is accepted if and only if the resulting expression is nullable, that is, accepts the empty string. We traverse the final derivative to recover the parse forest.

We memoize derivatives by adding entries to a state of type PWD. Initially, this cache contains only the input grammar, mapping nonterminal symbols to Pe values. Later, we place a derivative at the key formed by concatenating the characters involved in the derivative with the nonterminal symbol being derived.

For example, if S is a nonterminal symbol in the input grammar, then the key abS maps to derive 'a' (derive 'b' (NT "S")). We assume no nonterminal symbol in the input grammar is a suffix of any other nonterminal symbol, which is fine for a prototype.

It may help to imagine the grammar growing over time, gaining new production rules as we process input characters.

pwd :: PWD -> String -> String -> Maybe (String, M.Map String Ast)
pwd st start inp = if b then Just (s, astMemo st') else Nothing where
s = reverse inp ++ start
(b, st') = runState (parseForest s) st


## Nothing matters

The paper mentions using Kleene’s fixed point theorem to compute nullability. I believe for context-free grammars, rather than iterating a function until it stabilizes, we can specialize to the following simple algorithm.

We maintain a map from nonterminals to booleans indicating nullability. On encountering a nonterminal T, if already present in the map, then we use its memoized nullability. Otherwise, we temporarily say T is not nullable and insert False for its value in the map, before recursively computing nullability on the definition of T. Then we update the map with the returned nullability boolean.

For example, consider the rule:

T = T "x" | ""

Then to compute the nullability of T:

1. Assume T is non-nullable. We insert T -> False into our map.

2. Recurse on T "x" | "":

1. The first branch is non-nullable because T is non-nullable according to the map.

2. The second branch is a nullable terminal, so the whole expression is nullable.

3. We update the map with the result of the recursion, that is, we insert T -> True.

nullable :: Pe -> State PWD Bool
nullable pe = case pe of
Eps _  -> pure True
Del _  -> pure True
Nul    -> pure False
Ch _   -> pure False
Or xs  -> or <$> mapM nullable xs x :. y -> (&&) <$> nullable x <*> nullable y
NT s -> maybe (update s) pure . M.lookup s =<< gets nullMemo
where
update s = do
modify $\st -> st { nullMemo = M.insert s False$ nullMemo st }
b <- nullable =<< memoDerive s
modify $\st -> st { nullMemo = M.insert s b$ nullMemo st }
pure b


We wave our hands to prove this works. Since we start by assuming a nonterminal T is not nullable, the only possible pitfall is falsely concluding that a nullable nonterminal is non-nullable.

If the production rule for T contains an expression whose leaf nodes are all nullable terminals in one of its branches, then let’s say T is 1-nullable. The above correctly determines T is nullable. Otherwise if T can reduce to an expression whose leaf nodes at most $$k$$-nullable, then by inductive assumption our algorithm correctly computes the nullability of each of these nodes, and will therefore correctly infer that the $$(k+1)$$-nullable T is also nullable.

## You must remember this

To compute derivatives, we follow the rules given in the paper, memoizing as we go. There is no danger of an infinite loop because our nullability algorithm avoids nodes that have already been seen.

memoDerive :: String -> State PWD Pe
memoDerive cs@(c:s) = maybe update pure =<< M.lookup cs <$> gets grammar where update = do b <- derive c =<< memoDerive s modify$ \st -> st { grammar = M.insert cs b $grammar st } pure b memoDerive _ = error "unreachable" derive :: Char -> Pe -> State PWD Pe derive c pe = case pe of NT s -> pure$ NT $c:s Ch x | x == c -> pure$ Eps x
Or xs            -> Or <$> mapM (derive c) xs Del x :. y -> (Del x :.) <$> derive c y
x :. y           -> do
b <- nullable x
dx <- derive c x
if not b then pure $dx :. y else do dy <- derive c y pure$ Or [dx :. y, Del x :. dy]
_                -> pure Nul


Some algebra might make this more efficient. For example, a Nul branch can be removed, which in turn may simplify an Or expression. On the other hand, laziness could already be doing this for us. We’ll leave it as is, as we want our prototype to be simple.

## See the forest and the trees

The paper’s version of parseNull again computes a least fixed point. I again believe for context-free grammars we can specialize to a simpler algorithm. Our parseForest function memoizes nullable nonterminals to avoid infinite loops: the first time we encounter a nullable nonterminal, we insert a dummy map entry before recursing on its definition. Afterwards we update the map with the correct result.

If we encounter a nonterminal already present in the map, we just return its name without examining its value.

parseForest :: String -> State PWD Bool
parseForest s = do
m <- gets astMemo
case M.lookup s m of
Nothing -> do
pe <- memoDerive s
b <- nullable pe
when b $do modify$ \st -> st { astMemo = M.insert s (Ind s) $astMemo st } t <- maybe (error "nullable bug") id <$> parseNull pe
modify $\st -> st { astMemo = M.insert s t$ astMemo st }
pure b
Just _ -> pure True

parseNull :: Pe -> State PWD (Maybe Ast)
parseNull pe = case pe of
Eps x  -> pure $Just$ Lf x
Del x  -> parseNull x
Nul    -> pure Nothing
Ch _   -> pure Nothing
Or xs  -> uni <$> mapM parseNull xs x :. y -> liftM2 (:@) <$> parseNull x <*> parseNull y
NT s   -> parseForest s >>= \b -> pure $if b then Just$ Ind s
else Nothing

uni :: [Maybe Ast] -> Maybe Ast
uni xs = case concatMap deUni $catMaybes xs of [] -> Nothing [x] -> Just x t -> Just$ Union t
where
deUni (Union a) = a
deUni x         = [x]


The following prints the parse forest in a somewhat comprehensible form. We show the start symbol, followed by a table of definitions for each nullable nonterminal, where each Eps and Del has been expanded to their corresponding expressions. We surround nonterminals with braces to denote shared expressions.

We could clean up a little by inlining, at the cost of obscuring what the algorithm does.

showsForest :: String -> M.Map String Ast -> ShowS
showsForest s m = linkify (s++) . (" where\n"++) . foldr (.) id (line <$> M.assocs m) where line (k, v) = (k++) . (" = "++) . go v . ('\n':) go v = case v of Ind t -> linkify (t++) Union xs -> foldr1 (.)$ intersperse (" | "++) $(go <$> xs)
x :@ y -> go x . (' ':) . go y
Lf c -> (c:)
linkify f = ('{':) . f . ('}':)


We gratuitously show off tying the knot to dispense with the map. In other words, we replace the Ind nodes with native pointers to the Ast nodes they point to. We use the tied version in a routine that selects a random parse tree from the parse forest.

tie :: String -> M.Map String Ast -> Ast
tie start m = aux M.! start where
aux = go <$> m go e = case e of Ind t -> aux M.! t Union xs -> Union$ go <$> xs x :@ y -> go x :@ go y Lf _ -> e showsOne :: (ShowS -> ShowS) -> StdGen -> Ast -> ShowS showsOne paren g ast = case ast of Union xs -> let (n, g1) = randomR (0, length xs - 1) g in showsOne paren g1$ xs !! n
Lf c -> (c:)
x :@ y -> let (gx, gy) = split g
in paren $showsOne id gx x . showsOne addParen gy y _ -> error "knot-tying bug" where addParen s = ('(':) . s . (')':)  ## A grammar for grammars I should be eating my own dogfood and using parsing with derivatives to read the definition of a context-free grammar, but it’ll have to wait until I add more features. For now, we use parser combinators. cfg :: Parsec String () PWD cfg = PWD M.empty M.empty . M.fromList <$> between filler eof (many1 rule)
where
rule = (,) <$> sym <*> between (want "=") (want ";") expr expr = Or <$> cat sepBy1 want "|"
cat = foldl1 (:.) <$> many1 atm atm = str <|> NT <$> sym
sym = many1 alphaNum <* filler
str = strGram <$> between (char '"') (want "\"") (many$ noneOf "\"")
want s = string s <* filler
filler = skipMany $com <|> void space com = try (string "--") >> skipMany (noneOf "\n") strGram :: String -> Pe strGram "" = Eps '\949' strGram s = foldl1 (:.)$ Ch <$> s  For the empty string, we use a custom Eps so it shows up as a curly epsilon in the parse tree. This is cute, but bad for grammars that include curly epsilons! ## Frontend View the HTML source to see the hidden textareas that we harvest below. #ifdef __HASTE__ main :: IO () main = withElems ["grammar", "str", "out", "parse", "forest"]$
\[gEl, sEl, oEl, parseB, fEl] -> do
let
handle demo = do
Just b <- elemById $demo ++ "B" void$ b onEvent Click $const$ preset demo
preset demo = do
Just g <- elemById $demo ++ "G" Just s <- elemById$ demo ++ "S"
setProp oEl "value" ""
setProp fEl "value" ""
setProp gEl "value" =<< getProp g "value"
setProp sEl "value" =<< getProp s "value"
dump (m, s) = do
rnd <- newStdGen
setProp oEl "value" $showsOne id rnd (tie m s) "" setProp fEl "value"$ showsForest m s ""
handle "1+1"
handle "par"
handle "pal"
preset "1+1"
void $parseB onEvent Click$ const $do setProp oEl "value" "" setProp fEl "value" "" mg <- parse cfg "" <$> getProp gEl "value"
case mg of
Left e -> setProp oEl "value" $"error: " ++ show e Right g -> do s <- getProp sEl "value" if M.member "S"$ grammar g
then maybe (setProp oEl "value" "[parse failed]") dump $pwd g "S" s else setProp oEl "value" "missing start symbol S" #else main :: IO () main = print$ pwd g "S" $concat (replicate 39 "1+") ++ "+1" where Right g = parse cfg ""$ unlines
[ "S = T;"
, "T = T \"+\" T | N;"
, "N = \"1\";"
]
#endif


Ben Lynn blynn@cs.stanford.edu 💡