{-# LANGUAGE CPP #-}
#ifdef __HASTE__
{-# LANGUAGE PackageImports #-}
#endif
import Control.Arrow
#ifdef __HASTE__
import "mtl" Control.Monad.State
import Haste.DOM
import Haste.Events
#else
import Control.Monad.State
#endif
import Data.Maybe
import qualified Data.Map as M
import qualified Data.Set as S
import System.Random
import Text.Parsec hiding (State)
-- NT = non-terminal. (:.) = concatenation.
data Pe = NT String | Eps Char | Nul | Ch Char | Or [Pe] | Pe :. Pe | Del Pe
Parsing With Derivatives
Define a context-free grammar with start symbol "S", enter an input string, then hit the "Parse" button. We randomly pick one tree from the parse forest to display; clicking the button again rerolls the dice.
Concatenation is right-associative, e.g. "abc" means "a(bc)".
Large inputs fail because of relatively low browser limits, a relatively immature transpiler, and our prototype-quality code.
context-free grammar:
input string:
See:
-
Yacc is dead: a brief explanation of parsing with derivatives.
-
Parsing with Derivatives functional pearl: improved version of original paper.
-
On the Complexity and Performance of Parsing with Derivatives: proves the algorithm has cubic time complexity.
The source code
We define a Pe data structure to represent parsing expressions, that is, the right-hand side of the production rules of a grammar.
Although it represents the empty string, the Eps (for epsilon) expression holds a character that winds up in the abstract syntax trees (AST) returned by the parser. Similarly, the Del (for delta) expression, which is only generated internally, holds an expression which later helps build ASTs.
A context-free grammar maps non-terminal symbols to parsing expressions:
type Grammar = M.Map String Pe
Our ASTs are full binary trees whose leaf nodes are characters (the free magma on the alphabet). The tree structure captures the order the production rules are applied.
We store a parse forest, that is, all possible ASTs. There may be exponentially many, but they are related in such a way that we can store them compactly.
-- `Uni` stands for "union".
data Ast = Lf Char | Ast :@ Ast | Uni [Ast] deriving Show
A fancier parser might return a parse forest, that is, all possible parse trees for a given input. Ours will simply pick one possible parse tree.
Derivative work
To parse an input string, we first take successive derivatives of the start symbol with respect to each character of the input, taking care to leave bread crumbs in the Eps and Del expressions to record consumed characters. (The Del constructor is named for the delta symbol from the paper, but I also think of it as "deleted", because it remembers what has just been deleted from the input.)
Then the string is accepted if and only if the resulting expression is nullable, that is, accepts the empty string. As we traverse the expression to determine nullability, we also build ASTs to return.
We memoize derivatives by adding entries to a state of type Grammar. Initially, this cache contains only the input grammar, mapping nonterminal symbols to Pe values. Later, we place a derivative at the key formed by concatenating the characters involved in the derivative with the nonterminal symbol being derived.
For example, if S is a nonterminal symbol in the input grammar, then the key abS maps to derive 'a' (derive 'b' (NT "S")). We assume no nonterminal symbol in the input grammar is a suffix of any other nonterminal symbol, which is fine for a prototype.
It may help to imagine the grammar growing over time, gaining new production rules as we process input characters. Indeed, we consider nonterminals to refer to both nonterminals in the input grammar as well as their derivatives.
pwd :: Grammar -> String -> String -> Maybe Ast
pwd g start s = evalState (parseNull $ NT $ reverse s ++ start) g
Nothing matters
Computing nullability requires finding a least fixed point. I found this the toughest part of the algorithm, partly because they never taught fixed point theory when I was in school. For some reason, the method reminds me of Hopcroft’s algorithm to minimize a DFA, where we repeately refine a partition until we reach a stable answer.
We initially guess that each nonterminal is not nullable, which means it corresponds to Nothing. On encountering a nonterminal, if we’ve already seen it, then return our guess for that nonterminal. Otherwise, it’s the first time we’ve seen it and instead of guessing, we recursively traverse its corresponding expression. In doing so, we may discover our guess is wrong, so we correct it if necessary before returning ASTs.
We repeat until our guesses have stabilized. Guesses never change from something to Nothing, and the map of all guesses only changes if one of them changes from Nothing to something. We exploit these facts to simplify our code slightly.
parseNull :: Pe -> State Grammar (Maybe Ast)
parseNull pe = leastFix M.empty where
leastFix guessed = do
(b, (_, guessed')) <- runStateT (visit pe) (S.empty, guessed)
if M.size guessed == M.size guessed' then pure b else leastFix guessed'
visit :: Pe
-> StateT (S.Set String, M.Map String Ast) (State Grammar) (Maybe Ast)
visit pe = case pe of
Eps x -> pure $ Just $ Lf x
Del x -> visit x
Nul -> pure $ Nothing
Ch _ -> pure $ Nothing
Or xs -> uni <$> mapM visit xs
x :. y -> mul <$> visit x <*> visit y
NT s -> do
(seen, guessed) <- get
case () of
() | Just x <- M.lookup s guessed -> pure $ Just x
| S.member s seen -> pure Nothing
| otherwise -> do
modify $ first $ S.insert s
b <- visit =<< lift (memoDerive s)
maybe (pure ()) (modify . second . M.insert s) b
pure b
mul :: Maybe Ast -> Maybe Ast -> Maybe Ast
mul x y = (:@) <$> x <*> y
uni :: [Maybe Ast] -> Maybe Ast
uni xs | null xs' = Nothing
| [x] <- xs' = Just x
| otherwise = Just $ Uni xs'
where xs' = concatMap deUni $ catMaybes xs
deUni (Uni a) = a
deUni x = [x]
We add a lightweight variant of the above that returns a boolean instead of a parse forest.
nullable :: Pe -> State Grammar Bool
nullable pe = leastFix S.empty where
leastFix guessed = do
(b, (_, guessed')) <- runStateT (popIn pe) (S.empty, guessed)
if S.size guessed == S.size guessed' then pure b else leastFix guessed'
-- Like `visit`, but faster.
popIn :: Pe -> StateT (S.Set String, S.Set String) (State Grammar) Bool
popIn e = case e of
Eps _ -> pure True
Del _ -> pure True
Nul -> pure False
Ch _ -> pure False
Or xs -> or <$> mapM popIn xs
x :. y -> (&&) <$> popIn x <*> popIn y
NT s -> do
(seen, guessed) <- get
case () of
() | S.member s guessed -> pure True
| S.member s seen -> pure False
| otherwise -> do
modify $ first $ S.insert s
b <- popIn =<< lift (memoDerive s)
when b $ modify $ second $ S.insert s
pure b
You must remember this
Memoized derivatives are straightforward. For computing derivatives, we translate the rules given in the paper, and for memoization, on discovering a missing entry, we insert a knot-tying value before recursing, and replace it with the result of the recursion afteward.
memoDerive :: String -> State Grammar Pe
memoDerive cs@(c:s) = do
m <- get
unless (M.member cs m) $ do
modify $ M.insert cs $ NT cs
d <- derive c =<< memoDerive s
modify $ M.insert cs d
gets (M.! cs)
memoDerive _ = error "unreachable"
derive :: Char -> Pe -> State Grammar Pe
derive c pe = case pe of
NT s -> pure $ NT $ c:s
Ch x | x == c -> pure $ Eps x
Or xs -> Or <$> mapM (derive c) xs
Del x :. y -> (Del x :.) <$> derive c y
x :. y -> do
b <- nullable x
dx <- derive c x
if not b then pure $ dx :. y else do
dy <- derive c y
pure $ Or [dx :. y, Del x :. dy]
_ -> pure Nul
A grammar for grammars
I should be eating my own dogfood and using parsing with derivatives to read the definition of a context-free grammar, but it’ll have to wait until I add more features. For now, we use parser combinators.
cfg :: Parsec String () Grammar
cfg = M.fromList <$> between filler eof (many1 rule) where
rule = (,) <$> sym <*> between (want "=") (want ";") expr
expr = Or <$> cat `sepBy1` want "|"
cat = foldr1 (:.) <$> many1 atm
atm = str <|> (NT <$> sym)
sym = many1 alphaNum <* filler
str = do
s <- between (char '"') (want "\"") $ many $ noneOf "\""
pure $ if null s then Eps '\949' else foldr1 (:.) $ Ch <$> s
want :: String -> Parsec String () ()
want s = string s >> filler
filler = skipMany $ com <|> void space
com = try (string "--") >> skipMany (noneOf "\n")
One of a kind
Our Ast data structure holds a parse forest, which in general may be too large to show. Hence this function that shows a tree at random.
showOne :: (String -> String) -> StdGen -> Ast -> String
showOne paren g ast = case ast of
Uni xs -> let (n, g1) = randomR (0, length xs - 1) g
in showOne paren g1 $ xs !! n
Lf c -> [c]
x :@ y -> let (gx, gy) = split g
in paren $ showOne addParen gx x ++ showOne id gy y
where addParen s = concat ["(", s, ")"]
Frontend
View the HTML source to see the hidden textareas that we harvest below.
#ifdef __HASTE__
main :: IO ()
main = withElems ["grammar", "str", "out", "parse"] $
\[gEl, sEl, oEl, parseB] -> do
let
handle demo = do
Just b <- elemById $ demo ++ "B"
void $ b `onEvent` Click $ const $ preset demo
preset demo = do
Just g <- elemById $ demo ++ "G"
Just s <- elemById $ demo ++ "S"
setProp oEl "value" ""
setProp gEl "value" =<< getProp g "value"
setProp sEl "value" =<< getProp s "value"
handle "1+1"
handle "par"
handle "pal"
preset "1+1"
void $ parseB `onEvent` Click $ const $ do
setProp oEl "value" ""
mg <- parse cfg "" <$> getProp gEl "value"
case mg of
Left e -> setProp oEl "value" $ "error: " ++ show e
Right g -> do
s <- getProp sEl "value"
rnd <- newStdGen
setProp oEl "value" $ if M.member "S" g
then maybe "[parse failed]" (showOne id rnd) $ pwd g "S" s
else "missing start symbol S"
#else
main :: IO ()
main = print $ pwd g "S" $ concat (replicate 39 "1+") ++ "+1" where
Right g = parse cfg "" $ unlines
[ "S = T;"
, "T = T \"+\" T | N;"
, "N = \"1\";"
]
#endif