Regex Calculus

Over half a century ago, Brzozowski published Derivatives of Regular Expressions, which describes how to:

  • Directly convert a regular expression to a deterministic finite automaton (DFA). No NFAs needed.

  • Support rich regular expressions. In particular, we have logical AND and NOT, e.g. [a-z]*&!(()|do|for|if|while).

  • Instantly obtain small and often minimal DFAs in typical applications.

This elegant and powerful method was almost forgotten, and inferior constructions prevail today. Happily, it was rescued from obscurity by Owens, Reppy, and Turon, in Regular-expression derivatives reexamined. May regex derivatives one day earn their rightful place in computer science.

Regexes, irregularly

We tweak the traditional definition of regular expressions. For us, there is only one kind of constant:

  • \([c_0 c_1 ...]\): a character set or character class; accepts one of the characters in the given set.

Some notational conveniences: we may omit the square brackets for singleton sets, and we use hyphens to denote ranges of characters. For example, x means [x] and [a-d5-9] means [abcd56789].

Observe the empty character class \([]\) rejects all strings (even the empty string).

All other regexes are built from other regexes \(r\) and \(s\):

  • \(rs\): concatenation.

  • \(r\mid s\): logical or (alternation); the union of the two languages.

  • \(r\mbox{*}\): Kleene closure; zero or more strings of \(r\) concatenated together.

  • \(r\& s\): logical and; the intersection of the two languages.

  • \(!r\): logical not (complement); accepts a string if and only if \(r\) rejects it.

Let \(() = []\mbox{*}\), that is, the Kleene closure of the empty language. Observe \(()\) accepts only the empty string.

Textbooks often assume a small alphabet, so they can, for instance, cheaply iterate over each symbol of the alphabet. This is unrealistic for Unicode, and arguably for ASCII too. Instead of demanding a small alphabet, we ask only that certain operations on character classes be efficient.

Our code represents a character class either as Pos [Char] where the list holds the characters of the class, or as Neg [Char] where the list holds the characters outside the class. The function elemCC tests if a given character is in a given character class.

For example, Pos "abc" represents the character class [abc] while Neg "abc" represents the all characters except those in [abc], which we denote [^abc].

This scheme is easy to work with and suffices for common applications, which stick to narrow ranges such as [a-z] or [^0-9]. Since we store all characters in a range rather than just the endpoints, we’d choose a more sophisticated data structure if we expected unusually large ranges.

data CharClass = Pos String | Neg String deriving (Eq, Ord)

elemCC :: Char -> CharClass -> Bool
elemCC c (Pos cs) = c `elem` cs
elemCC c (Neg cs) = c `notElem` cs

We use lists for the arguments of logical operations. We later explain why.

data Re = OneOf CharClass | Re :. Re | Kleene Re
  | ReOr [Re] | ReAnd [Re] | ReNot Re
  deriving (Eq, Ord)

Useful regex constants:

-- | The regex `()`. The language containing only the empty string.
eps :: Re
eps = Kleene noGood

-- | The regex `[]`. The empty language.
noGood :: Re
noGood = OneOf $ Pos []

-- | The regex `.*`. The language containing everything.
allGood :: Re
allGood = Kleene $ OneOf $ Neg []

Regex exercises

Let \(f\) be a regex.

1. Does \(f\) accept the empty string?

If so, we say \(f\) is nullable.

Character sets never accept the empty string. The Kleene closure always accepts the empty string. Concatenation does if and only if \(r\) and \(s\) do. The logical operations commute with nullability.

nullable :: Re -> Bool
nullable re = case re of
  OneOf _  -> False
  Kleene _ -> True
  r :. s   -> nullable r && nullable s
  ReOr  rs -> any nullable rs
  ReAnd rs -> all nullable rs
  ReNot r  -> not $ nullable r

2. What regex do we get after feeding the character \(c\) to \(f\)? In other words, what regex accepts a string \(s\) precisely when \(f\) accepts \(c\) followed by \(s\)?

We call this regex the derivative of the regex \(f\) with respect to the character \(c\), and write \(\partial_c f\). For example, \(\partial_a\) ab*c|d*e*f|g*ah = b*c|h.

For a character set, the answer is \(()\) if \(c\) is a member. Otherwise it’s \([]\). For example, \(\partial_a\) [abc] = () and \(\partial_a\) [xyz] = [].

For the other cases, let us follow Lagrange and use a prime mark to denote a derivative (with respect to \(c\)). We find \(r'\) and \(s'\), that is, we recursively answer the question for \(r\) and \(s\). Then the logical operations commute with taking derivatives:

  • \((r\mid s)' = r' \mid s'\)

  • \((r\& s)' = r' \& s'\)

  • \((!r)' = !r'\)

For the Kleene closure, we find:

  • \((r\mbox{*})' = r'r\mbox{*}\)

The trickiest is concatenation: \(rs\). First, we answer the first question, that is, determine if \(r\) accepts the empty string. If so, the answer is \(r's\mid s'\). If not, the answer is just \(r's\).

In the following code, for now, ignore mk and pretend (#.) = (:.). We define them later.

derive :: Char -> Re -> Re
derive c re = case re of
  OneOf cc | elemCC c cc -> eps
           | otherwise   -> noGood
  Kleene r               -> derive c r #. mkKleene r
  r :. s   | nullable r  -> mkOr [derive c r #. s, derive c s]
           | otherwise   -> derive c r #. s
  ReAnd rs               -> mkAnd $ derive c <$> rs
  ReOr  rs               -> mkOr  $ derive c <$> rs
  ReNot r                -> mkNot $ derive c r

A regex operation is completely defined by its nullability and its derivative. The following function determines if a given regex accepts a given string:

accepts :: Re -> String -> Bool
accepts re ""    = nullable re
accepts re (c:s) = accepts (derive c re) s

Classy Regexes

Regexes are generally content with a blurry view of the alphabet. Particular states may only care about digits, or whitespace, or certain letters. Indeed, the error state is an extreme case, acting identically on all symbols. Especially with alphabets such as Unicode, it pays to have a single arrow cover vast swathes of symbols, rather than build one arrow per symbol.

Thus we desire an algorithm that, for a given regex, partitions the alphabet into as few character classes as possible so the regex behaves correctly on the first input character even if members of the same class are indistinguishable. This is infeasible, but we get by with an imperfect algorithm. It may chop up the alphabet too finely, but it’ll do.

(Why do we only care about the first input character? All will become clear in the next section!)

If the regex is just a character class, then we can achieve perfection. Divide the alphabet into the "haves" and the "have-nots": those characters within the class, and those without.

Otherwise, we recursively construct our partition of the alphabet. For the Kleene star and logical not, we use the partition of the underlying regex. The other operations have two regex arguments, and we can stay out of trouble by taking all pairs of intersections of character classes of their partitions.

For concatenation, we can do better when the first regex rejects the empty string: in this case, we use the partition of the first regex and ignore that of the second.

classy :: Re -> [CharClass]
classy re = case re of
  OneOf (Pos cs) -> [Pos cs, Neg cs]
  OneOf (Neg cs) -> [Pos cs, Neg cs]
  Kleene r       -> classy r
  ReNot  r       -> classy r
  ReOr   rs      -> foldl1' allPairs $ classy <$> rs
  ReAnd  rs      -> foldl1' allPairs $ classy <$> rs
  r :. s | nullable r -> classy r `allPairs` classy s
         | otherwise  -> classy r
  where allPairs r s = nub $ intersectCC <$> r <*> s

intersectCC :: CharClass -> CharClass -> CharClass
intersectCC (Pos xs) (Pos ys) = Pos $ intersect xs ys
intersectCC (Pos xs) (Neg ys) = Pos $ xs \\ ys
intersectCC (Neg xs) (Pos ys) = Pos $ ys \\ xs
intersectCC (Neg xs) (Neg ys) = Neg $ union xs ys

unionCC :: CharClass -> CharClass -> CharClass
unionCC (Pos xs) (Pos ys) = Pos $ union xs ys
unionCC (Pos xs) (Neg ys) = Neg $ ys \\ xs
unionCC (Neg xs) (Pos ys) = Neg $ xs \\ ys
unionCC (Neg xs) (Neg ys) = Neg $ intersect xs ys

The function repCC returns a member of a character class. For Pos, we pick the first character in the list. This always succeeds because our code never calls repCC with Pos []. For Neg, we search for the smallest character not in the list. We assume Neg never applies to the whole alphabet.

repCC :: CharClass -> Char
repCC (Pos (h:_)) = h
repCC (Pos [])    = error "BUG! Pos [] should be filtered out."
repCC (Neg cs) | Just c <- find (`notElem` cs) [minBound..] = c
               | otherwise = error "Neg with entire alphabet."

Regexes restated

We can now directly construct a DFA for any regex \(f\). We view a regex as a state of a DFA.

The start state is the input regex \(f\). For each character class \(C\) in a sufficiently fine partition for \(f\) (i.e. classy f), pick any representative \(c\), create the state \(\partial_c f\) if it doesn’t already exist, then draw an arrow labeled \(C\) from \(f\) to \(\partial_c f\).

b*|daab*|c*ad

Repeat on all freshly created states. The accepting states are those which accept the empty string. Done!

We map states to integers to simplify our interface; users of our engine need only deal with integers. We retain a map of integers to regexes in case the caller seeks a deeper understanding of our DFA.

mkDfa :: Re -> ([(Int, Re)], Int, [Int], [((Int, Int), CharClass)])
mkDfa r = (swap <$> M.assocs states, states!r, as, M.assocs collated) where
  collated = M.fromListWith unionCC edges
  (states, edges) = explore (M.singleton r 0, []) r
  as = snd <$> filter (nullable . fst) (M.assocs states)
  explore gr q = foldl' (goto q) gr $ filter (/= Pos []) $ classy q
  goto q (qs, ds) cc
    | Just w <- M.lookup qc qs = (qs, mkEdge w)
    | otherwise = explore (M.insert qc sz qs, mkEdge sz) qc
    where qc = derive (repCC cc) q
          sz = M.size qs
          mkEdge dst = ((qs!q, dst), cc):ds

The above always terminates so long as we’re mindful that the logical or operation is:

  • idempotent: \(r\mid r = r\)

  • commutative: \(r\mid s = s\mid r\)

  • associative: \((r\mid s)\mid t = r\mid (s\mid t)\)

This makes sense intuitively, because taking a derivative usually yields a simpler regex. The glaring exception is the Kleene star, but on further inspection, we ought to repeat ourselves eventually after taking enough derivatives so long as we can cope with the proliferating logical ors.

In practice, we apply more algebraic identities before comparing regexes to get smaller DFAs. Ideally, we’d like to tell if two given regexes are equivalent so we could generate the minimal DFA every time, but this is too costly.

We represent arguments of (&) and (|) with lists so we can call nub to notice idempotence and sort to sort out commutativity. We capture associativity by flattening lists, except for concatenation where we use pattern matching.

(#.) :: Re -> Re -> Re
r #. s
  | r == noGood || s == noGood = noGood
  | r == eps    = s
  | s == eps    = r
  | x :. y <- r = x #. (y #. s)
  | otherwise   = r :. s

mkOr :: [Re] -> Re
mkOr xs
  | allGood `elem` zs = allGood
  | null zs           = noGood
  | [z] <- zs         = z
  | otherwise         = ReOr zs
  where
  zs = nub $ sort $ filter (/= noGood) flat
  flat           = concatMap deOr xs
  deOr (ReOr rs) = rs
  deOr r         = [r]

mkAnd :: [Re] -> Re
mkAnd xs
  | noGood `elem` zs = noGood
  | null zs          = allGood
  | [z] <- zs        = z
  | otherwise        = ReAnd zs
  where
  zs = nub $ sort $ filter (/= allGood) flat
  flat             = concatMap deAnd xs
  deAnd (ReAnd rs) = rs
  deAnd r          = [r]

mkKleene :: Re -> Re
mkKleene (Kleene s) = mkKleene s
mkKleene r          = Kleene r

mkNot :: Re -> Re
mkNot (OneOf (Pos [])) = allGood
mkNot (ReNot s)        = s
mkNot r                = ReNot r

This completes our regex engine.

Reading Regexes

We employ parser combinators to parse regex patterns. We deviate from conventional syntax slightly. We add the metacharacters & and ! for logical and and logical not. We lack + and ?, but:

  • r+ is equivalent to !()|r*

  • r? is equivalent to ()|r

type Parser = Parsec String ()
regex :: Parser Re
regex = mkOr <$> ands `sepBy` char '|' where
  ands = mkAnd <$> cats `sepBy` char '&'
  cats = foldr (#.) eps <$> many nots
  nots = (char '!' >> mkNot <$> nots) <|> (atm >>= kle)
  atm = chCl
    <|> (const (OneOf $ Neg []) <$> char '.')
    <|> between (char '(') (char ')') regex
  kle :: Re -> Parser Re
  kle r = (char '*' >> kle (mkKleene r)) <|> pure r
  chCl = fmap OneOf $ (Pos . (:[]) <$> single)
     <|> between (char '[') (char ']') parity
  parity = option Pos (const Neg <$> char '^') <*>
    (nub . sort . concat <$> many rng)
  rng = alphaNum >>= \lo -> hiEnd lo <|> pure [lo]
  single = (char '\\' >> oneOf meta) <|> noneOf meta
  hiEnd :: Char -> Parser String
  hiEnd lo = do
    void $ char '-'
    hi <- alphaNum
    when (hi < lo) $ fail "invalid range end"
    pure [lo..hi]

meta :: String
meta = "\\|&!*.[]()"

Rendering Regexes

Tedious code to print character classes and regexes:

instance Show CharClass where
  show cc = case cc of
    Pos []  -> "[]"
    Neg []  -> "."
    Pos [c] -> showSingle c
    Neg [c] -> concat ["[^", showSingle c, "]"]
    Pos s   -> concat ["[" , f $ sort s, "]"]
    Neg s   -> concat ["[^", f $ sort s, "]"]
    where
    showSingle c | c `elem` meta = '\\':[c]
                 | otherwise     = [c]
    f ""    = ""
    f [c]   = [c]
    f (c:t) = rangeFinder c c t
    rangeFinder lo hi (h:t) | h == succ hi = rangeFinder lo h t
    rangeFinder lo hi t
      |            lo  == hi = lo:f t
      |       succ lo  == hi = lo:hi:f t
      | succ (succ lo) == hi = [lo..hi] ++ f t
      | otherwise            = lo:'-':hi:f t

instance Show Re where
  show = show' (0 :: Int) where
    show' p re = case re of
      OneOf s -> show s
      Kleene (OneOf (Pos [])) -> "()"
      Kleene r -> show' 4 r ++ "*"
      ReNot r  -> paren (p > 3) $ ('!':) $ show' 3 r
      r :. s   -> paren (p > 2) $ show' 2 r ++ show' 2 s
      ReAnd rs -> paren (p > 1) $ intercalate "&" $ show' 1 <$> rs
      ReOr rs  -> paren (p > 0) $ intercalate "|" $ show' 0 <$> rs
    paren True  s = '(':s ++ ")"
    paren False s = s

We place DFA nodes with a force-directed layout algorithm. We’ll explain it some other time.

Here, we bump into a downside of lazy evaluation: the function iterate is lazy, leading to stack issues if we iterate much more. We ought to fix this.

forceDirect :: (Int -> Int -> Bool) -> [(Int, [Double])] -> [(Int, [Double])]
forceDirect isEdge vs = iterate (step isEdge) vs!!128

step :: (Int -> Int -> Bool) -> [(Int, [Double])] -> [(Int, [Double])]
step isEdge m = nudge <$> m where
  nudge (a, v) = (,) a $ foldl' (zipWith (+)) v $ force <$> m where
    force (b, w)
      | b == a    = [0, 0]
      | otherwise = ((0.1 * f) *) <$> ba
      where
      ba = zipWith (-) v w
      d = sqrt . sum $ (^(2::Int)) <$> ba
      f = 300 / (d*d) - bool 0 1 (isEdge a b)

Lastly, we have messy code that draws the DFA with SVG, and sets up the above UI. It has nothing to do with regexes.

Submatching

I hope to implement submatching soon. I believe it should be straightforward. We extend the Re type with capturing groups:

data Re = ... | Group Int Re

Both nullable and derive commute with Group.

We annotate each DFA state with group numbers as follows. Let r be the regex associated with the state. If r is a character class or Kleene closure, then return the empty list. If r is a concatenation then return the group number of the first argument. If r is a group, then return its group number concatenated with any group numbers of the underlying regex. This leaves logical operations, in which case we concatentate the lists of group numbers of the arguments.

Each group number is associated with a start offset and an end offset. Then when running the DFA, as we transition from state to state, we also pay attention to the associated group numbers. The first time we see a group number \(n\), we set its start and end offsets to the current offset.

As we travel further, we update the end offset of group \(n\) until we reach a state whose set of group numbers does not contain \(n\). Afterwards the offsets of group \(n\) are fixed; they are never modified again.

Some applications may prefer a streaming API for regex matching, in which case submatching would update buffers instead of offsets.


Ben Lynn blynn@cs.stanford.edu 💡