expr -> expr op expr expr -> "-" number expr -> number expr -> "(" expr ")" op -> "+" op -> "-" op -> "*" op -> "/" number -> bit number number -> bit "" bit -> "0" bit -> "1"
Parsing With Derivatives
Recall a context-free grammar is a set of production rules where each rule describes how a variable (also called a non-terminal) can be replaced with a sequence of variables and strings (also called terminals). For example, the following is a context-free grammar for binary arithmetic expressions:
In practice, rules for the same variable are collected together and written as
a single rule, with a vertical bar (|)
separating the possible expansions.
Denoting subexpressions with parentheses saves even more space. For example,
our above grammar can be written:
expr -> expr op expr | number | "-" number | "(" expr ")" op -> "+" | "-" | "*" | "/" number -> bit (number | "") bit -> "0" | "1"
It is well-known that determining if a string can be produced by a given context-free grammars (the parsing problem) can be solved in cubic time with Earley’s algorithm or the GLR algorithm.
But it is less well-known that it can also be done via derivatives:
-
Yacc is dead: a brief explanation of parsing with derivatives.
-
Parsing with Derivatives functional pearl: improved version of original paper.
-
On the Complexity and Performance of Parsing with Derivatives: proves the algorithm has cubic time complexity.
In fact, derivatives are so obscure that some veteran computer scientists deny these facts. Russ Cox claims it’s "easy construct [sic] plausible examples that take exponential time to reject an input":
S -> T T -> T + T | N N -> 1
Supposedly, parsing with derivatives takes exponential time to reject inputs of the form "1+1+1+…1+1+1+1".
Is he right? Let’s find out with a little code.
Parsing parsing expressions; a grammar for grammars
We define a data type to represent the right-hand side of a production rule,
which we call Pe
for parsing expression.
Non-empty strings are represented as concatenations of Char
values.
The empty string is represented by Empty
.
data Pe = Var String | Ch Char | Empty | Pe :. Pe | Pe :| Pe | Nul | Eps Char | Del Pe deriving Show
The purpose of the mysterious data constructors Nul, Eps, Del
will be become
clear later. They are unneeded for representing a grammar.
We whip up a parser for grammars like our example with parser combinators. We store the grammar in a map, keyed by the variable on the left-hand side of a production rule.
jsEval "curl_module('../compiler/Charser.ob')"
jsEval "curl_module('../compiler/Map.ob')"
import Charser import Map cfg :: Charser (Map String Pe) cfg = filler *> (fromList <$> some rule) <* eof where rule = (,) <$> sym <*> (want "=" *> expr <* want ";") expr = foldl1 (:|) <$> cat `sepBy1` want "|" cat = foldl1 (:.) <$> some atm atm = str <|> Var <$> sym <|> want "(" *> expr <* want ")" sym = some alphaNumChar <* filler str = char '"' *> (ngram <$> many (noneOf "\"")) <* want "\"" want s = string s <* filler filler = many $ com <|> oneOf " \n" *> pure () com = string "--" *> some (noneOf "\n") *> pure () ngram = \case "" -> Empty s -> foldr1 (:.) $ Ch <$> s
For example:
Right calc = parse cfg "" [r| expr = expr op expr | number | "-" number | "(" expr ")"; op = "+" | "-" | "*" | "/"; number = bit (number | ""); bit = "0" | "1"; |]
The least fixpoint
Now for a brief interlude: abstract mathematics that turns out to have concrete applications.
Suppose we have a bunch of simultaneous Boolean equations with a single variable on the left:
A = A ∨ (B ∧ C) ∨ 0 B = 1 ∨ D C = C ∨ B D = E E = D
Suppose each variable can take one of three values: 0 (false), 1 (true), or ⊥ (undefined). Then we can find a solution as follows:
-
Set all variables to ⊥.
-
For each equation, evaluate the right-hand side, using rules such as 1 ∨ ⊥ = 1 and ⊥ ∧ 0 = 0, along with the latest values of the variables.
-
Did this change any values of the variables? If a 0 changed to 1 or vice versa, then report a contradiction and halt. If ⊥ became 0 or 1 then go back to step 2. Otherwise halt.
This works because the only transitions for each variable assignment that keep the algorithm going are from ⊥ to 0 or from ⊥ to 1. So if there are \(N\) variables, then we halt after at most \(N\) steps.
In mathematical terms, we can define a partial order:
\[\{0,1,\bot | \bot \le 0, \bot \le 1\}\]
which induces a partial order on the set of possible assignments \( \{0,1,\bot\}^N \); an assignment is less than or equal to another exactly when this is true coordinate-wise. We can think of a lesser assignment as being less defined.
In particular, the \(N\)-tuple where each coordinate is \(\bot\) is less than or equal to every other assignment; it is the bottom of our partially ordered set of assignments.
We represent the special case of contradiction with \(\top\) and define this element to be greater than all other elements.
Our algorithm can be viewed as repeatedly feeding the output of a function \(f\) to itself until there are no more valid changes. In other words, it finds a fixed point of \(f\).
This function \(f\) is monotonic, namely \(x \le y \implies f(x) \le f(y)\), and we start from the least element \(\bot\), so we wind up with a least fixpoint of \(f\).
The element ⊤ is always a fixpoint of \(f\). In general there may exist multiple nontrivial fixpoints. In our above example, the least fixpoint is \(\{1, 1, 1, \bot, \bot\}\), but there are two other nontrivial fixpoints \(\{1, 1, 1, 0, 0\}\) and \(\{1, 1, 1, 1, 1\}\).
See the Kleene Fixed-Point Theorem. Least fixpoints feature in denotational semantics, though we’ll only need them to solve Boolean equations.
Nothing matters
A parsing expression is nullable if it accepts the empty string.
The following converts a parsing expression to a Boolean formula that evaluates to true exactly when it is nullable. The nullability of expressions can depend on the nullability of the nonterminals it contains, thus the Boolean formulas may contain Boolean variables.
data BFor = BCon Bool | BVar String | BFor :\/ BFor | BFor :/\ BFor deriving Show bfor = \case Nul -> BCon False Eps _ -> BCon True Del _ -> BCon True Empty -> BCon True Ch _ -> BCon False x :| y -> case bfor x of BCon True -> BCon True BCon False -> bfor y _ -> case bfor y of BCon True -> BCon True BCon False -> bfor x _ -> bfor x :\/ bfor y x :. y -> case bfor x of BCon True -> bfor y BCon False -> BCon False _ -> case bfor y of BCon True -> bfor x BCon False -> BCon False _ -> bfor x :/\ bfor y Var s -> BVar s
Thanks to constant folding, an expression is either true, false, or a bunch of variables connected with logical AND and OR operations. From the rules, we see there is never a need for NOT.
example = [r| A = "X"; B = "" | "" | ""; C = C | "" | A; D = E E F; E = F "X" F | F F; F = D F | D "" E | E F; |] (bfor <$>) <$> parse cfg "" example
We solve the equations by finding the least fixpoint using the algorithm described earlier, but with some optimizations. For each variable, we record the formulas it appears in. When the truth value of a variable has been determined, we only re-evaluate the formulas in which it appears, rather than all formulas.
Re-evaluating a formula is simply constant folding and constant propagation:
reval solns = go where go = \case BCon b -> BCon b BVar v -> case mlookup v solns of Nothing -> BVar v Just b -> BCon b x :\/ y -> case go x of BCon True -> BCon True BCon False -> go y _ -> case go y of BCon True -> BCon True BCon False -> go x _ -> go x :\/ go y x :/\ y -> case go x of BCon True -> go y BCon False -> BCon False _ -> case go y of BCon True -> go x BCon False -> BCon False _ -> go x :/\ go y
For example:
reval (fromList [("A", True), ("B", False)]) $ BVar "A" :/\ (BVar "B" :\/ BVar "A")
Our implementation is crude. The deps
map associates a Boolean variable v
with a plain list of other variables whose nullability formulas contain v
somewhere, hence the uniqcons
helper for preventing duplicates in a list.
mustLookup k m = maybe (error "mustLookup") id $ mlookup k m uniqcons x xs | elem x xs = xs | otherwise = x:xs leastfix solns bots deps = \case [] -> foldr (flip insert False) solns $ keys bots v:todo -> case mlookup v solns of Just _ -> leastfix solns bots deps todo Nothing -> case reval solns $ mustLookup v bots of BCon b -> leastfix (insert v b solns) (delete v bots) deps $ foldr uniqcons todo $ maybe [] id $ mlookup v deps sol -> leastfix solns (insert v sol bots) deps todo
We bundle a few data structures together in the PWD
data type:
-
The nonterminals whose nullabilities are known.
-
A map of nonterminals to the parsing expressions they expand to.
-
A list of nonterminals whose nullability we need to figure out.
data PWD = PWD { nulls :: Map String Bool , derivs :: Map String Pe , inbox :: [String] } deriving Show
The following computes the nullability of each variable in a given context-free grammar. After computing the least fixpoint, any remaining undefined variables are considered non-nullable, as they correspond to production rules that never terminate, apart from branches that lead to non-nullable terminals.
bvars = \case BCon _ -> [] BVar s -> [s] x :\/ y -> bvars x `union` bvars y x :/\ y -> bvars x `union` bvars y solve pwd = pwd { nulls = leastfix (nulls pwd) bots deps $ inbox pwd , inbox = [] } where bots = foldr (\v -> insert v $ bfor $ mustLookup v $ derivs pwd) Tip $ inbox pwd deps = foldr addDeps Tip $ toAscList bots addDeps (v, t) m = foldr ($) m $ flip (insertWith uniqcat1) [v] <$> bvars t uniqcat1 [x] xs = uniqcons x xs initNulls m = solve $ PWD { nulls = Tip , derivs = m , inbox = keys m }
In our next example, only D
is nullable:
nulls . initNulls <$> parse cfg "" [r| A = A | "x"; B = C; C = B; D = D | ""; |]
We can also completely solve our example from above:
nulls . initNulls <$> parse cfg "" example
Derivative Work
We write \(D_c f\) for the the derivative of a parsing expression \(f\) with respect to a character \(c.\)
We provisionally define \(D_c f\) to be a parsing expression \(f'\) that accepts a string \(s\) if and only if \(f\) accepts \(c : s\).
For example, since the empty string rejects any character, for all \(c\) we
have \(D_c \varepsilon = \emptyset\), where \(\varepsilon\) denotes the empty
string, and \(\emptyset\) denotes a parsing expression that accepts no strings,
which is Nul
in our code.
Naturally, we get:
\[ D_c (f | g) = D_c f | D_c g \]
What about \(D_c x\) for a character \(x\)? If \(c \ne x\) then obviously \(D_c x = \emptyset\).
But if \(c = x\), although \(D_c x = \varepsilon\) is a reasonable answer, in practice, solving the parsing problem is not enough; we want to return at least one parse tree, and ideally the entire parse forest.
Thus we tag each \(\varepsilon\) with the character that just got eaten, which will later allow us to construct parse trees:
\[ D_c c = \varepsilon_c \]
We represent this with the Eps
constructor in our code.
Similarly, if \(f\) is nullable, we have:
\[D_c (f . g) = D_c f . g | \delta_f D_c g \]
where \(\delta_f\) is like \(\varepsilon\) except it is tagged with a parsing
expression, that again later allows us to construct parse trees. We represent
this with the Del
constructor in our code.
If \(f\) is non-nullable then we simply have:
\[D_c (f . g) = D_c f . g \]
The rules for computing derivatives are conceptually simple, but consider a production rule like:
S = S | "x";
Then:
\[ D_c S = D_c S | D_c \text{x} \]
and we see the derivative \(D_c S\) depends on itself. In general, there could be circular dependencies among multiple derivatives.
We maintain a map of all derivatives we have computed. Initially, it holds
a context-free grammar: the keys are variable names and the values are the
parsing expressions they expand to. For a key s
and a character c
, we
store the derivative of its parsing expression with respect to c
at the
key c:s
, which is convenient but can cause collisions. We carefully
write our demos to avoid them. For example, we might stick to capital letters
for variables and never use them as constants on the right-hand side.
When computing a derivative of a variable f
, if f
is already in the map,
then we simply return f
. Otherwise, we insert a placeholder undefined
value
for f
and add f
to the inbox
list before traversing the parsing
expression corresponding to f
. Eventually, this finds all dependencies of the
derivatives of f
without getting caught in an infinite loop.
Next we, determine the nullabilities of f
and its dependencies, so that we
can derive the derivatives we just computed.
We can probably simplify the code. Rather than repeated solve
-derive
cycles,
perhaps we could mix it all up in code that resembles dynamic programming with
memoization. But my gut feeling is it’s faster to solve nullability in waves.
derive :: Char -> String -> State PWD Pe derive c s = do pwd <- get case mlookup (c:s) $ derivs pwd of Just _ -> pure () Nothing -> do put $ pwd { derivs = insert (c:s) undefined $ derivs pwd , inbox = (c:s) : inbox pwd } d <- go $ mustLookup s $ derivs pwd pwd <- get put $ pwd { derivs = insert (c:s) d $ derivs pwd } pure $ Var $ c:s where nullable pe = do pwd <- get case reval (nulls pwd) $ bfor pe of BCon b -> pure b _ -> error "unreachable" go = \case Var t -> derive c t Ch x | x == c -> pure $ Eps x x :| y -> (#|) <$> go x <*> go y Del x :. y -> (Del x #.) <$> go y x :. y -> do b <- nullable x dx <- go x if not b then pure $ dx #. y else do dy <- go y pure $ (dx #. y) #| (Del x #. dy) _ -> pure Nul (#.) = \cases Nul y -> Nul x Nul -> Nul x y -> x :. y (#|) = \cases Nul y -> y x Nul -> x x y -> x :| y
To determine if a grammar accepts a given string, we repeatedly derive the start symbol with respect to each successive character of the string. The resulting parsing expression is nullable if and only if the grammar accepts the string.
Let’s call its corresponding nonterminal the "final" symbol.
deriveString m start s = foldl go (start, initNulls m) s where go (key, pwd) c = (c:key, solve $ snd $ runState (derive c key) pwd)
For example, the following grammar accepts palindromic strings of bits of even length (a language beyond the capabilities of DPDAs).
Right epal = parse cfg "" [r|S = "0" S "0" | "1" S "1" | "";|] deriveString epal "S" "1001"
It checks out. The nullability map takes "1001S"
to True
. Recall this
string is how we represent \(D_1 D_0 D_0 D_1 S\), and since this parsing
expression is nullable, we know the grammar accepts "1001"
. More generally,
we see all palindromic strings are accepted and non-palindromic strings are
rejected.
Its parse forest is encoded in the messy-looking parsing expressions in the second map.
Triming trees
We write a function that preserves only the nullable parts of the final parsing expression:
vars = \case Var v -> [v] x :. y -> vars x ++ vars y x :| y -> vars x ++ vars y _ -> [] filterNull pwd final = mgo final Tip where mgo k m = case mlookup k m of Nothing -> case go $ mustLookup k $ derivs pwd of Nothing -> m Just ast -> foldr mgo (insert k ast m) $ vars ast _ -> m go = \case Nul -> Nothing Ch _ -> Nothing Empty -> Just $ Empty Eps c -> Just $ Eps c Del x -> go x x :| y -> go x `mkOr` go y x :. y -> (:.) <$> go x <*> go y Var s -> if mustLookup s $ nulls pwd then Just $ Var s else Nothing mkOr = \cases x Nothing -> x Nothing y -> y (Just x) (Just y) -> Just $ x :| y
We write a pretty printer for the filtered expressions:
pretty p = \case Eps c -> (c:) Empty -> ("\xce\xb5"++) x :. y -> pretty 1 x . pretty 1 y x :| y -> showParen (p == 1) $ pretty 0 x . (" | "++) . pretty 0 y Var s -> ('<':) . (s++) . ('>':) prettyDemo g start s = putStr $ case toAscList $ filterNull pwd final of [] -> "parse failed\n" ds -> "<" ++ final ++ ">\nwhere:\n" ++ unlines (showDefn <$> ds) where showDefn (k, v) = "<" ++ k ++ "> = " ++ pretty 0 v "" (final, pwd) = deriveString g start s
For example:
prettyDemo epal "S" "11011011"
For ambiguous grammars, we print all possible values of all variables:
let Right g = parse cfg "" [r|S = S "+" S | "1";|] in prettyDemo g "S" "1+1+1+1"
We enumerate all members of a parse forest by going through all combinations. To distinguish between the parse trees, we parenthesize nontrivial variable expansions:
forest m k = go 0 $ mustLookup k m where go p = \case x :. y -> (.) <$> go 0 x <*> go 0 y x :| y -> go 0 x ++ go 0 y Var s -> showParen (p == 0) <$> go 1 (mustLookup s m) Eps c -> pure (c:) Empty -> pure ("\xce\xb5"++) let Right g = parse cfg "" [r|S = S "+" S | "1";|] (final, pwd) = deriveString g "S" "1+1+1+1" in putStr $ unlines $ ($"") <$> forest (filterNull pwd final) final
We write a function that chooses a random parse tree. The distribution is not
uniform: we just flip a coin each time we encounter an (:|)
node while
traversing the parsing expression.
parseFrac = uncurry (/) . foldl (\(a, b) d -> (a*10 + fromIntegral (ord d - ord '0'), b*10)) (0.0, 1) jsrandom = parseFrac . drop 2 <$> jsEval "Math.random()" sample g start s = putStrLn . ($"") =<< go 0 (mustLookup final m) where go p = \case Eps c -> pure (c:) Empty -> pure ("\xce\xb5"++) Var s -> showParen (p == 0) <$> go 1 (mustLookup s m) x :. y -> (.) <$> go 0 x <*> go 0 y x :| y -> do b <- (<1) . (2*) <$> jsrandom go 0 (if b then x else y) (final, pwd) = deriveString g start s m = filterNull pwd final
For example, here’s a sample parse tree from our first example grammar:
sample calc "expr" "11*(10+100)/1"
Our next example is a grammar that accepts matching pairs of parentheses (a language beyond the capabilities of DFAs). The sample should be unique in this case:
Right catalan = parse cfg "" [r|S = "[" S "]" S | "";|] sample catalan "S" "[[[]][][]][]"
Believe it or not
The moment of truth has arrived. We start with the following grammar:
Right coxEx = parse cfg "" [r| S = T; T = T "+" T | N; N = "1"; |]
Russ Cox alleges that parsing with derivatives takes exponential time to reject the following input:
prettyDemo coxEx "S" "1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1++1"
He also alleges it takes a long time to go far beyond the first few parse trees:
sample coxEx "S" "1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1+1"
Despite the questionable quality of our code (and our compiler), we still manage to reject the first input and print a random parse tree for the second.
Why Russ Cox Is Wrong
Consider the following grammar with start symbol A
:
A = ("0" | "1") B B = ("0" | "1") C C = ("0" | "1")
Even though there are only 3 rules, this grammar accepts \(2^3 = 8\) different strings.
More generally, we can construct similar grammars with \(N\) rules that accept \(2^N\) different strings. In other words, a data structure whose size is polynomial in \(N\) can represent a parse forest whose size is exponential in \(N\).
Parsing with derivatives exhibits this phenomenon, which appears frequently in computer science under many different names: sharing, hash consing, lazy evaluation, automatic differentiation, …
We view the algorithm as adding rules to the grammar as we parse the input. The final parsing expression may represent a parse forest of exponential size, but the expression itself remains polynomial:
prettyDemo coxEx "S" "1+1+1+1+1"
If the input is rejected, our final symbol expands to Nul
, that is, it
accepts the empty language. For some reason, Cox falsely accuses the algorithm
of backtracking here.
Returning a parse result is also efficient. We simply commit to one of the alternatives whenever a choice is offered while traversing the final parsing expression. Again, no backtracking.
Cox also fusses about ambiguity. This indeed is an important issue in language design, so it is strange that Cox condemns derivatives on these grounds. Firstly, parsing with derivatives trivially detects ambiguity: are there "|" operators in the final parsing expression? Secondly, common resolution strategies are also trivial. Want ambiguous infix operators to be left-associative? Then pick the first alternative each time.
At a higher level, Cox could have avoided hallucinating by tinkering with code like we just did, instead of penning a post riddled with errors. Could it be that Cox shied away from implementing parsing with derivatives because he was unfamiliar with languages in which this algorithm is easy to express. If so, it’s concerning that Cox was in charge of designing a programming language for a while. Does this mean entire classes of algorithms are difficult for many computer scientists to implement, or even comprehend?
Why I Was Wrong
An earlier version of this webpage claimed that we could avoid finding the least fixpoint when determining nullability. This happened to be true for the example grammars on this page, but my statement was false in general. A grammar with sufficient mutual recursion would cause my orginal implementation to fail. Realizing this compelled me to roll up my sleeves and write a routine to find a least fixpoint.
Of course, I have no idea if there are any remaining bugs!