Parsing With Derivatives

Hit the "Parse" button to parse a given input with a given context-free grammar with start symbol "S". We randomly pick one tree from the parse forest to display. Clicking again rerolls the dice.

We also show the entire parse forest, using sharing to avoid exponential output. Try something like "1+1+1+1" for a more readable example.

context-free grammar:

input:

parse forest:

See:

Yacc is dead: a brief explanation of parsing with derivatives.
Parsing with Derivatives functional pearl: improved version of original paper.
On the Complexity and Performance of Parsing with Derivatives: proves the algorithm has cubic time complexity.

Derivative work

We define a Pe data structure to represent parsing expressions, that is, the right-hand side of the production rules of a grammar.

{-# LANGUAGE CPP #-}
#ifdef __HASTE__
{-# LANGUAGE PackageImports #-}
import "mtl" Control.Monad.State.Strict
import Haste.DOM
import Haste.Events
#else
import Control.Monad.State.Strict
#endif
import Data.List (intersperse)
import Data.Maybe
import qualified Data.Map as M
import System.Random
import Text.Parsec hiding (State)

-- NT = non-terminal. (:.) = concatenation.
data Pe = NT String | Eps Char | Nul | Ch Char | Or [Pe] | Pe :. Pe | Del Pe

Although it represents the empty string, the Eps (for epsilon) expression holds a character that winds up in the abstract syntax trees (AST) returned by the parser. Similarly, the Del (for delta) expression, which is only generated internally, holds an expression which later helps build ASTs.

Our ASTs are full binary trees whose leaf nodes are characters (the free magma on the alphabet). The tree structure captures the order the production rules are applied.

We compute a parse forest, that is, all possible ASTs. It may contain exponentially many trees, but we represent it compactly by sharing. For this reason, our AST supports indirection nodes, and we keep around a map from strings to ASTs.

We also maintain another similar map to help compute nullability.

Lastly, we turn the grammar into yet another map to which we add more entries as we parse.

data Ast = Lf Char | Ast :@ Ast | Union [Ast] | Ind String deriving Show

data PWD = PWD
  { nullMemo :: M.Map String Bool
  , astMemo :: M.Map String Ast
  , grammar :: M.Map String Pe
  }

To parse an input string, we repeatedly derive the start symbol with respect to each character of the input, taking care to leave bread crumbs in the Eps and Del expressions to record consumed characters. (The Del constructor is named for the delta symbol from the paper, but I also think of it as "deleted", because it remembers what has just been deleted from the input.)

Then the string is accepted if and only if the resulting expression is nullable, that is, accepts the empty string. We traverse the final derivative to recover the parse forest.

We memoize derivatives by adding entries to a state of type PWD. Initially, this cache contains only the input grammar, mapping nonterminal symbols to Pe values. Later, we place a derivative at the key formed by concatenating the characters involved in the derivative with the nonterminal symbol being derived.

For example, if S is a nonterminal symbol in the input grammar, then the key abS maps to derive 'a' (derive 'b' (NT "S")). We assume no nonterminal symbol in the input grammar is a suffix of any other nonterminal symbol, which is fine for a prototype.

It may help to imagine the grammar growing over time, gaining new production rules as we process input characters.

pwd :: PWD -> String -> String -> Maybe (String, M.Map String Ast)
pwd st start inp = if b then Just (s, astMemo st') else Nothing where
   s = reverse inp ++ start
   (b, st') = runState (parseForest s) st

Nothing matters

The paper mentions using Kleene’s fixed point theorem to compute nullability. I believe for context-free grammars, rather than iterating a function until it stabilizes, we can specialize to the following simple algorithm.

We maintain a map from nonterminals to booleans indicating nullability. On encountering a nonterminal T, if already present in the map, then we use its memoized nullability. Otherwise, we temporarily say T is not nullable and insert False for its value in the map, before recursively computing nullability on the definition of T. Then we update the map with the returned nullability boolean.

For example, consider the rule:

T = T "x" | ""

Then to compute the nullability of T:

Assume T is non-nullable. We insert T → False into our map.
Recurse on T "x" | "":
1. The first branch is non-nullable because T is non-nullable according to the map.
2. The second branch is a nullable terminal, so the whole expression is nullable.
We update the map with the result of the recursion, that is, we insert T → True.

nullable :: Pe -> State PWD Bool
nullable pe = case pe of
  Eps _  -> pure True
  Del _  -> pure True
  Nul    -> pure False
  Ch _   -> pure False
  Or xs  -> or <$> mapM nullable xs
  x :. y -> (&&) <$> nullable x <*> nullable y
  NT s -> maybe (update s) pure . M.lookup s =<< gets nullMemo
  where
  update s = do
    modify $ \st -> st { nullMemo = M.insert s False $ nullMemo st }
    b <- nullable =<< memoDerive s
    modify $ \st -> st { nullMemo = M.insert s b $ nullMemo st }
    pure b

We wave our hands to prove this works. Since we start by assuming a nonterminal T is not nullable, the only possible pitfall is falsely concluding that a nullable nonterminal is non-nullable.

If the production rule for T contains an expression whose leaf nodes are all nullable terminals in one of its branches, then let’s say T is 1-nullable. The above correctly determines T is nullable. Otherwise if T can reduce to an expression whose leaf nodes at most k-nullable, then by inductive assumption our algorithm correctly computes the nullability of each of these nodes, and will therefore correctly infer that the (k+1)-nullable T is also nullable.

You must remember this

To compute derivatives, we follow the rules given in the paper, memoizing as we go. There is no danger of an infinite loop because our nullability algorithm avoids nodes that have already been seen.

memoDerive :: String -> State PWD Pe
memoDerive cs@(c:s) = maybe update pure =<< M.lookup cs <$> gets grammar
  where
  update = do
    b <- derive c =<< memoDerive s
    modify $ \st -> st { grammar = M.insert cs b $ grammar st }
    pure b
memoDerive _ = error "unreachable"

derive :: Char -> Pe -> State PWD Pe
derive c pe = case pe of
  NT s             -> pure $ NT $ c:s
  Ch x | x == c    -> pure $ Eps x
  Or xs            -> Or <$> mapM (derive c) xs
  Del x :. y       -> (Del x :.) <$> derive c y
  x :. y           -> do
    b <- nullable x
    dx <- derive c x
    if not b then pure $ dx :. y else do
      dy <- derive c y
      pure $ Or [dx :. y, Del x :. dy]
  _                -> pure Nul

Some algebra might make this more efficient. For example, a Nul branch can be removed, which in turn may simplify an Or expression. On the other hand, laziness could already be doing this for us. We’ll leave it as is, as we want our prototype to be simple.

See the forest and the trees

The paper’s version of parseNull again computes a least fixed point. I again believe for context-free grammars we can specialize to a simpler algorithm. Our parseForest function memoizes nullable nonterminals to avoid infinite loops: the first time we encounter a nullable nonterminal, we insert a dummy map entry before recursing on its definition. Afterwards we update the map with the correct result.

If we encounter a nonterminal already present in the map, we just return its name without examining its value.

parseForest :: String -> State PWD Bool
parseForest s = do
  m <- gets astMemo
  case M.lookup s m of
    Nothing -> do
      pe <- memoDerive s
      b <- nullable pe
      when b $ do
        modify $ \st -> st { astMemo = M.insert s (Ind s) $ astMemo st }
        t <- maybe (error "nullable bug") id <$> parseNull pe
        modify $ \st -> st { astMemo = M.insert s t $ astMemo st }
      pure b
    Just _ -> pure True

parseNull :: Pe -> State PWD (Maybe Ast)
parseNull pe = case pe of
  Eps x  -> pure $ Just $ Lf x
  Del x  -> parseNull x
  Nul    -> pure Nothing
  Ch _   -> pure Nothing
  Or xs  -> uni <$> mapM parseNull xs
  x :. y -> liftM2 (:@) <$> parseNull x <*> parseNull y
  NT s   -> parseForest s >>= \b -> pure $ if b
    then Just $ Ind s
    else Nothing

uni :: [Maybe Ast] -> Maybe Ast
uni xs = case concatMap deUni $ catMaybes xs of
  []  -> Nothing
  [x] -> Just x
  t   -> Just $ Union t
  where
  deUni (Union a) = a
  deUni x         = [x]

The following prints the parse forest in a somewhat comprehensible form. We show the start symbol, followed by a table of definitions for each nullable nonterminal, where each Eps and Del has been expanded to their corresponding expressions. We surround nonterminals with braces to denote shared expressions.

We could clean up a little by inlining, at the cost of obscuring what the algorithm does.

showsForest :: String -> M.Map String Ast -> ShowS
showsForest s m = linkify (s++) . (" where\n"++) . foldr (.) id (line <$> M.assocs m)
  where
  line (k, v) = (k++) . (" = "++) . go v . ('\n':)
  go v = case v of
    Ind t -> linkify (t++)
    Union xs -> foldr1 (.) $ intersperse (" | "++) $ (go <$> xs)
    x :@ y -> go x . (' ':) . go y
    Lf c -> (c:)
  linkify f = ('{':) . f . ('}':)

We gratuitously show off tying the knot to dispense with the map. In other words, we replace the Ind nodes with native pointers to the Ast nodes they point to. We use the tied version in a routine that selects a random parse tree from the parse forest.

tie :: String -> M.Map String Ast -> Ast
tie start m = aux M.! start where
  aux = go <$> m
  go e = case e of
    Ind t    -> aux M.! t
    Union xs -> Union $ go <$> xs
    x :@ y   -> go x :@ go y
    Lf _     -> e

showsOne :: (ShowS -> ShowS) -> StdGen -> Ast -> ShowS
showsOne paren g ast = case ast of
  Union xs -> let (n, g1) = randomR (0, length xs - 1) g
    in showsOne paren g1 $ xs !! n
  Lf c -> (c:)
  x :@ y -> let (gx, gy) = split g
    in paren $ showsOne id gx x . showsOne addParen gy y
  _ -> error "knot-tying bug"
  where addParen s = ('(':) . s . (')':)

A grammar for grammars

I should be eating my own dogfood and using parsing with derivatives to read the definition of a context-free grammar, but it’ll have to wait until I add more features. For now, we use parser combinators.

cfg :: Parsec String () PWD
cfg = PWD M.empty M.empty . M.fromList <$> between filler eof (many1 rule)
  where
  rule = (,) <$> sym <*> between (want "=") (want ";") expr
  expr = Or <$> cat `sepBy1` want "|"
  cat = foldl1 (:.) <$> many1 atm
  atm = str <|> NT <$> sym
  sym = many1 alphaNum <* filler
  str = strGram <$> between (char '"') (want "\"") (many $ noneOf "\"")
  want s = string s <* filler
  filler = skipMany $ com <|> void space
  com = try (string "--") >> skipMany (noneOf "\n")

strGram :: String -> Pe
strGram "" = Eps '\949'
strGram s  = foldl1 (:.) $ Ch <$> s

For the empty string, we use a custom Eps so it shows up as a curly epsilon in the parse tree. This is cute, but bad for grammars that include curly epsilons!

Frontend

View the HTML source to see the hidden textareas that we harvest below.

#ifdef __HASTE__
main :: IO ()
main = withElems ["grammar", "str", "out", "parse", "forest"] $
    \[gEl, sEl, oEl, parseB, fEl] -> do
  let
    handle demo = do
      Just b <- elemById $ demo ++ "B"
      void $ b `onEvent` Click $ const $ preset demo
    preset demo = do
      Just g <- elemById $ demo ++ "G"
      Just s <- elemById $ demo ++ "S"
      setProp oEl "value" ""
      setProp fEl "value" ""
      setProp gEl "value" =<< getProp g "value"
      setProp sEl "value" =<< getProp s "value"
    dump (m, s) = do
      rnd <- newStdGen
      setProp oEl "value" $ showsOne id rnd (tie m s) ""
      setProp fEl "value" $ showsForest m s ""
  handle "1+1"
  handle "par"
  handle "pal"
  preset "1+1"
  void $ parseB `onEvent` Click $ const $ do
    setProp oEl "value" ""
    setProp fEl "value" ""
    mg <- parse cfg "" <$> getProp gEl "value"
    case mg of
      Left e -> setProp oEl "value" $ "error: " ++ show e
      Right g -> do
        s <- getProp sEl "value"
        if M.member "S" $ grammar g
          then maybe (setProp oEl "value" "[parse failed]") dump $ pwd g "S" s
          else setProp oEl "value" "missing start symbol S"
#else
main :: IO ()
main = print $ pwd g "S" $ concat (replicate 39 "1+") ++ "+1" where
  Right g = parse cfg "" $ unlines
    [ "S = T;"
    , "T = T \"+\" T | N;"
    , "N = \"1\";"
    ]
#endif

Ben Lynn blynn@cs.stanford.edu 💡