Parsing With Derivatives

Define a context-free grammar with start symbol "S", enter an input string, then hit the "Parse" button. We randomly pick one tree from the parse forest to display; clicking the button again rerolls the dice.

Concatenation is right-associative, e.g. "abc" means "a(bc)".

Large inputs fail because of relatively low browser limits, a relatively immature transpiler, and our prototype-quality code.

context-free grammar:

input string:

See:

The source code

We define a Pe data structure to represent parsing expressions, that is, the right-hand side of the production rules of a grammar.

{-# LANGUAGE CPP #-}
#ifdef __HASTE__
{-# LANGUAGE PackageImports #-}
#endif
import Control.Arrow
#ifdef __HASTE__
import "mtl" Control.Monad.State
import Haste.DOM
import Haste.Events
#else
import Control.Monad.State
#endif
import Data.Maybe
import qualified Data.Map as M
import qualified Data.Set as S
import System.Random
import Text.Parsec hiding (State)

-- NT = non-terminal. (:.) = concatenation.
data Pe = NT String | Eps Char | Nul | Ch Char | Or [Pe] | Pe :. Pe | Del Pe

Although it represents the empty string, the Eps (for epsilon) expression holds a character that winds up in the abstract syntax trees (AST) returned by the parser. Similarly, the Del (for delta) expression, which is only generated internally, holds an expression which later helps build ASTs.

A context-free grammar maps non-terminal symbols to parsing expressions:

type Grammar = M.Map String Pe

Our ASTs are full binary trees whose leaf nodes are characters (the free magma on the alphabet). The tree structure captures the order the production rules are applied.

We store a parse forest, that is, all possible ASTs. There may be exponentially many, but they are related in such a way that we can store them compactly.

-- `Uni` stands for "union".
data Ast = Lf Char | Ast :@ Ast | Uni [Ast] deriving Show

A fancier parser might return a parse forest, that is, all possible parse trees for a given input. Ours will simply pick one possible parse tree.

Derivative work

To parse an input string, we first take successive derivatives of the start symbol with respect to each character of the input, taking care to leave bread crumbs in the Eps and Del expressions to record consumed characters. (The Del constructor is named for the delta symbol from the paper, but I also think of it as "deleted", because it remembers what has just been deleted from the input.)

Then the string is accepted if and only if the resulting expression is nullable, that is, accepts the empty string. As we traverse the expression to determine nullability, we also build ASTs to return.

We memoize derivatives by adding entries to a state of type Grammar. Initially, this cache contains only the input grammar, mapping nonterminal symbols to Pe values. Later, we place a derivative at the key formed by concatenating the characters involved in the derivative with the nonterminal symbol being derived.

For example, if S is a nonterminal symbol in the input grammar, then the key abS maps to derive 'a' (derive 'b' (NT "S")). We assume no nonterminal symbol in the input grammar is a suffix of any other nonterminal symbol, which is fine for a prototype.

It may help to imagine the grammar growing over time, gaining new production rules as we process input characters. Indeed, we consider nonterminals to refer to both nonterminals in the input grammar as well as their derivatives.

pwd :: Grammar -> String -> String -> Maybe Ast
pwd g start s = evalState (parseNull $ NT $ reverse s ++ start) g

Nothing matters

Computing nullability requires finding a least fixed point. I found this the toughest part of the algorithm, partly because they never taught fixed point theory when I was in school. For some reason, the method reminds me of Hopcroft’s algorithm to minimize a DFA, where we repeately refine a partition until we reach a stable answer.

We initially guess that each nonterminal is not nullable, which means it corresponds to Nothing. On encountering a nonterminal, if we’ve already seen it, then return our guess for that nonterminal. Otherwise, it’s the first time we’ve seen it and instead of guessing, we recursively traverse its corresponding expression. In doing so, we may discover our guess is wrong, so we correct it if necessary before returning ASTs.

We repeat until our guesses have stabilized. Guesses never change from something to Nothing, and the map of all guesses only changes if one of them changes from Nothing to something. We exploit these facts to simplify our code slightly.

parseNull :: Pe -> State Grammar (Maybe Ast)
parseNull pe = leastFix M.empty where
  leastFix guessed = do
    (b, (_, guessed')) <- runStateT (visit pe) (S.empty, guessed)
    if M.size guessed == M.size guessed' then pure b else leastFix guessed'

visit :: Pe
      -> StateT (S.Set String, M.Map String Ast) (State Grammar) (Maybe Ast)
visit pe = case pe of
  Eps x  -> pure $ Just $ Lf x
  Del x  -> visit x
  Nul    -> pure $ Nothing
  Ch _   -> pure $ Nothing
  Or xs  -> uni <$> mapM visit xs
  x :. y -> mul <$> visit x <*> visit y
  NT s -> do
    (seen, guessed) <- get
    case () of
      () | Just x <- M.lookup s guessed -> pure $ Just x
         | S.member s seen -> pure Nothing
         | otherwise -> do
           modify $ first $ S.insert s
           b <- visit =<< lift (memoDerive s)
           maybe (pure ()) (modify . second . M.insert s) b
           pure b

mul :: Maybe Ast -> Maybe Ast -> Maybe Ast
mul x y = (:@) <$> x <*> y

uni :: [Maybe Ast] -> Maybe Ast
uni xs | null xs'   = Nothing
       | [x] <- xs' = Just x
       | otherwise  = Just $ Uni xs'
       where xs' = concatMap deUni $ catMaybes xs
             deUni (Uni a) = a
             deUni x       = [x]

We add a lightweight variant of the above that returns a boolean instead of a parse forest.

nullable :: Pe -> State Grammar Bool
nullable pe = leastFix S.empty where
  leastFix guessed = do
    (b, (_, guessed')) <- runStateT (popIn pe) (S.empty, guessed)
    if S.size guessed == S.size guessed' then pure b else leastFix guessed'
  -- Like `visit`, but faster.
  popIn :: Pe -> StateT (S.Set String, S.Set String) (State Grammar) Bool
  popIn e = case e of
    Eps _  -> pure True
    Del _  -> pure True
    Nul    -> pure False
    Ch _   -> pure False
    Or xs  -> or <$> mapM popIn xs
    x :. y -> (&&) <$> popIn x <*> popIn y
    NT s -> do
      (seen, guessed) <- get
      case () of
        () | S.member s guessed -> pure True
           | S.member s seen -> pure False
           | otherwise -> do
             modify $ first $ S.insert s
             b <- popIn =<< lift (memoDerive s)
             when b $ modify $ second $ S.insert s
             pure b

You must remember this

Memoized derivatives are straightforward. For computing derivatives, we translate the rules given in the paper, and for memoization, on discovering a missing entry, we insert a knot-tying value before recursing, and replace it with the result of the recursion afteward.

memoDerive :: String -> State Grammar Pe
memoDerive cs@(c:s) = do
  m <- get
  unless (M.member cs m) $ do
    modify $ M.insert cs $ NT cs
    d <- derive c =<< memoDerive s
    modify $ M.insert cs d
  gets (M.! cs)
memoDerive _ = error "unreachable"

derive :: Char -> Pe -> State Grammar Pe
derive c pe = case pe of
  NT s             -> pure $ NT $ c:s
  Ch x | x == c    -> pure $ Eps x
  Or xs            -> Or <$> mapM (derive c) xs
  Del x :. y       -> (Del x :.) <$> derive c y
  x :. y           -> do
    b <- nullable x
    dx <- derive c x
    if not b then pure $ dx :. y else do
      dy <- derive c y
      pure $ Or [dx :. y, Del x :. dy]
  _                -> pure Nul

A grammar for grammars

I should be eating my own dogfood and using parsing with derivatives to read the definition of a context-free grammar, but it’ll have to wait until I add more features. For now, we use parser combinators.

cfg :: Parsec String () Grammar
cfg = M.fromList <$> between filler eof (many1 rule) where
  rule = (,) <$> sym <*> between (want "=") (want ";") expr
  expr = Or <$> cat `sepBy1` want "|"
  cat = foldr1 (:.) <$> many1 atm
  atm = str <|> (NT <$> sym)
  sym = many1 alphaNum <* filler
  str = do
    s <- between (char '"') (want "\"") $ many $ noneOf "\""
    pure $ if null s then Eps '\949' else foldr1 (:.) $ Ch <$> s
  want :: String -> Parsec String () ()
  want s = string s >> filler
  filler = skipMany $ com <|> void space
  com = try (string "--") >> skipMany (noneOf "\n")

One of a kind

Our Ast data structure holds a parse forest, which in general may be too large to show. Hence this function that shows a tree at random.

showOne :: (String -> String) -> StdGen -> Ast -> String
showOne paren g ast = case ast of
  Uni xs -> let (n, g1) = randomR (0, length xs - 1) g
    in showOne paren g1 $ xs !! n
  Lf c -> [c]
  x :@ y -> let (gx, gy) = split g
    in paren $ showOne addParen gx x ++ showOne id gy y
  where addParen s = concat ["(", s, ")"]

Frontend

View the HTML source to see the hidden textareas that we harvest below.

#ifdef __HASTE__
main :: IO ()
main = withElems ["grammar", "str", "out", "parse"] $
    \[gEl, sEl, oEl, parseB] -> do
  let
    handle demo = do
      Just b <- elemById $ demo ++ "B"
      void $ b `onEvent` Click $ const $ preset demo
    preset demo = do
      Just g <- elemById $ demo ++ "G"
      Just s <- elemById $ demo ++ "S"
      setProp oEl "value" ""
      setProp gEl "value" =<< getProp g "value"
      setProp sEl "value" =<< getProp s "value"
  handle "1+1"
  handle "par"
  handle "pal"
  preset "1+1"
  void $ parseB `onEvent` Click $ const $ do
    setProp oEl "value" ""
    mg <- parse cfg "" <$> getProp gEl "value"
    case mg of
      Left e -> setProp oEl "value" $ "error: " ++ show e
      Right g -> do
        s <- getProp sEl "value"
        rnd <- newStdGen
        setProp oEl "value" $ if M.member "S" g
          then maybe "[parse failed]" (showOne id rnd) $ pwd g "S" s
          else "missing start symbol S"
#else
main :: IO ()
main = print $ pwd g "S" $ concat (replicate 39 "1+") ++ "+1" where
  Right g = parse cfg "" $ unlines
    [ "S = T;"
    , "T = T \"+\" T | N;"
    , "N = \"1\";"
    ]
#endif

Ben Lynn blynn@cs.stanford.edu 💡