Outcoding UNIX geniuses

Static types prevent disasters by catching bugs at compile time. Yet many languages have self-defeating types. Type annotation can be so laborious that weary programmers give up and switch to unsafe languages.

Adding insult to injury, some of these prolix type systems are simultaneously inexpressive. Some lack parametric polymorphism, forcing programmers to duplicate code or subvert static typing with casts. A purist might escape this dilemma by writing code to generate code, but this leads to another set of problems.

Type theory shows how to avoid these pitfalls, but many programmers seem unaware:

Popular authors Bruce Eckel and Robert C. Martin mistakenly believe strong typing implies verbosity, and worse still, testing conquers all. Tests are undoubtedly invaluable, but at best they "prove" by example. Testing can be used "…to show the presence of bugs, but never to show their absence". One could even argue a test-heavy approach helps attackers find exploits, as untested cases may hint at overlooked bugs. Instead of testing, we need strong static types so we can prove correctness of programs.
The designers of the Go language, including famed former Bell Labs researchers, have been stumped by polymorphism for years. To be fair, parametric polymorphism is only half the story. The other half is ad hoc polymorphism, which Haskell researchers only neatly solved in the late 1980s with type classes. Practical Haskell compilers also need more type trickery for unboxing.

Why is this so? Perhaps people think the theory is arcane, dry, and impractical?

By working through some programming interview questions, we’ll find the relevant theory is surprisingly accessible. We quickly arrive at a simple type inference or type reconstruction algorithm that seems too good to be true: it powers strongly typed languages that support parametric polymorphism without requiring any type declarations.

We’ll write code that figures out the function:

\x y z -> x z(y z)

has type:

(a -> b -> c) -> (a -> b) -> a -> c

And given that (+) has type Int -> Int -> Int, our code determines that the function:

\x -> (+) (x 42)

has type:

(Int -> Int) -> Int -> Int

1. Identifying twins

Determine if two binary trees storing integers in their leaf nodes are equal.

Solution: Two words: deriving Eq.

data Tree a = Leaf a | Branch (Tree a) (Tree a) deriving Eq

Haskell’s derived instance feature automatically works on any algebraic data type built on any type for which equality makes sense. It even works for mutually recursive data types (see Data.Tree):

data Tree a = Node a (Forest a) deriving Eq
data Forest a = Forest [Tree a] deriving Eq

Perhaps my interviewer would ask me to explain deriving Eq does. Roughly speaking, it generates code like the following, saving the programmer from stating the obvious:

data Tree a = Leaf a | Branch (Tree a) (Tree a)

eq (Leaf x)       (Leaf y)       = x == y
eq (Branch xl xr) (Branch yl yr) = eq xl yl && eq xr yr
eq _              _              = False

2. On assignment

This time, the first tree may contain variables in place of integers. Can we unify the trees, that is, assign integers to all variables so the trees are equal? The same variable may appear more than once.

Solution: We extend our data structure to hold variables:

data Tree a = Var String | Leaf a | Tree a :. Tree a deriving Show

As before, we traverse both trees and look for nodes that differ in value or type. If the first is a variable, then we record a constraint, that is, a variable assignment that is required for the trees to be equal such as a = 4 or b = 2. If there are conflicting values for the same variable, then we indicate failure by returning Nothing. Otherwise we return Just the list of assignments found.

solve1 = go [] where
  go as = \cases
    (xl :. xr) (yl :. yr)           -> do
      as0 <- go as xl yl
      go as0 xr yr
    (Leaf x)   (Leaf y) | x == y    -> Just as
                        | otherwise -> Nothing
    (Var v)    (Leaf x)             -> case lookup v as of
      Nothing             -> Just $ (v, x):as
      Just x' | x == x'   -> Just as
              | otherwise -> Nothing
    _          _                    -> Nothing

solve1 (Leaf 1) (Leaf 2)
solve1 (Leaf 3) (Leaf 3)
solve1 (Leaf 4) (Var "x")
solve1 (Var "x" :. Var "y") (Leaf 5 :. Leaf 6)
solve1 (Var "x" :. Var "x") (Leaf 5 :. Leaf 6)

3. Both Sides, Now

Now suppose leaf nodes in both trees can hold integer variables. Can two trees be made equal by assigning certain integers to the variables? If so, find the most general solution.

Solution: We proceed as before, but now we may encounter constraints such as a = b, which equate two variables. To handle such a constraint, we pick one of the variables, such as a, and replace all occurrences of a with the other side, which is b in our example. This eliminates a from all constraints. Eventually, all our constraints have an integer on at least one side, which we check for consistency.

We write a function that substitutes a given tree for a given variable. This is more general than necessary since we only need to substitute variables for variables, but we’ll soon need its full power.

sub x t = \case
  Var x' | x == x' -> t
  l :. r           -> sub x t l :. sub x t r
  a                -> a

sub "x" (Leaf 1 :. Leaf 2)
  ((Var "x" :. (Leaf 3 :. Var "y")) :. Var "x")

We discard redundant constraints where the same variable appears on both sides, such as a = a. Thus a variable may wind up with no integer assigned to it, which means if a solution exists, it can take any value.

Our code is inefficient. Rather than repeat the same chain of substitutions to reach a representative variable, we should use a more efficient disjoint-set data structure.

foldSub = foldr $ uncurry sub
solve2 = go [] where
  go as = \cases
    (Leaf x)   (Leaf y) | x == y    -> Just as
                        | otherwise -> Nothing
    (xl :. xr) (yl :. yr)           -> go as xl yl >>=
      \as0 -> go as0 (foldSub xr as0) (foldSub yr as0)
    (Var _)    (_ :. _)             -> Nothing
    (Var v)    (Var w)  | v == w    -> Just as
    (Var v)    t                    -> Just $ (v, t):as
    x          y@(Var _)            -> go as y x
    _          _                    -> Nothing

solve2 ((Var "x" :. Leaf 1) :. Var "z")  ((Var "y" :. Var "y") :. Leaf 2)
solve2 ((Var "x" :. Leaf 1) :. Var "z")  ((Var "y" :. Var "z") :. Leaf 2)
solve2 ((Var "x" :. Leaf 1) :. Var "x")  ((Var "y" :. Var "z") :. Leaf 2)

4. Once more, with subtrees

What if variables can represent subtrees?

Solution: Although we’ve significantly generalized the problem, our answer almost remains the same.

When unifying a variable v against a tree t, to avoid infinite recursion, we perform an occurs check: the variable v must not occur in t, except in the trivial case when v == t. If we pass this check, then we record that in future, we should replace v with the tree resulting from applying all substitutions found so far to t.

treeUnify :: (Show a, Eq a) => Tree a -> Tree a -> Maybe [(String, Tree a)]
treeUnify = go [] where
  go as = \cases
    (Leaf x)   (Leaf y) | x == y    -> Just as
                        | otherwise -> Nothing
    (xl :. xr) (yl :. yr)           -> do
      as0 <- go as xl yl
      go as0 (foldSub xr as0) (foldSub yr as0)
    (Var v)    (Var w)  | v == w    -> Just as
    (Var v)    t
      | occurs t  -> Nothing
      | otherwise -> Just $ (v, foldSub t as):as
      where
      occurs = \case
        l :. r         -> occurs l || occurs r
        Var w | v == w -> True
        _              -> False
    x          y@(Var _)            -> go as y x
    _          _                    -> Nothing

treeUnify (Var "x")
          (Leaf 5)
treeUnify (Var "x" :. Leaf 5)
          (Leaf 4  :. Var "y")
treeUnify ((Var "x"            :. Var "x")              :. Var "w")
          (((Leaf 3 :. Leaf 5) :. (Var "y" :. Var "z")) :. Var "w")

5. Type inference!

Design a language based on lambda calculus where integers and strings are primitve types, and where we can deduce whether a given expression is typable, that is, whether types can be assigned to the untyped bindings so that the expression is well-typed. If so, find the most general type.

For example, the expression \f -> f 2 which takes its first argument f and applies to the integer 2 must have type (Int -> u) -> u. Here, u is a type variable, that is, we can substitute u with any type. This is known as parametric polymorphism.

More precisely the inferred type is most general, or principal if:

Substituting types such as Int or (Int -> Int) -> Int (sometimes called type constants for clarity) for all the type variables results in a well-typed closed term.
There are no other ways of typing the given expression.

Solution: We define a tree to represent types. Like our previous trees, leaf nodes can be variables (TV for type variable) or constants (TC for type constant). The (:->) constructor plays the same role as (:.) above, but this time it also represents functions.

infixr 5 :->
data Type = TC String | Type :-> Type | TV String
instance Show Type where
  showsPrec d = \case
    TC s -> (s++)
    TV s -> (s++)
    a :-> b -> showParen (d > 0) $ showsPrec 1 a . (" -> "++) . shows b

We write a function tsub that substitutes a given tree for a given variable, analogous to sub above:

tsub (x, t) = \case
  TV x' | x == x' -> t
  a :-> b         -> tsub (x, t) a :-> tsub (x, t) b
  y               -> y

Tree unification is almost the same, though this time we switch from Maybe to Either so we can add error messages:

unify :: Type -> Type -> Either String [(String, Type)]
unify  = go [] where
  go as = \cases
    (TC x)      (TC y) | x == y    -> Right as
    (xl :-> xr) (yl :-> yr)        -> do
      as0 <- go as xl yl
      go as0 (foldr tsub xr as0) (foldr tsub yr as0)
    (TV v)      (TV w)  | v == w   -> Right as
    (TV v)      t
      | occurs t  -> Left "occurs check"
      | otherwise -> Right $ (v, foldr tsub t as):as
      where
      occurs = \case
        l :-> r       -> occurs l || occurs r
        TV w | v == w -> True
        _             -> False
    x           y@(TV _)           -> go as y x
    x           y                  -> Left $ show x ++ " versus " ++ show y

Next, we define a data structure to hold an expression in our language. Applications, lambda abstractions, variables, integers, and strings:

infixl 5 :@
data Expr = Expr :@ Expr | Lam String Expr | V String | I Int | S String
instance Show Expr where
  showsPrec d = \case
    a :@ b -> showParen (d > 0) $ shows a . (' ':) . showsPrec 1 b
    Lam v t -> ('\\':) . (v++) . (" -> "++) . shows t
    V s -> showParen (s == "+") $ (s++)
    I i -> shows i
    S s -> shows s

Unfortunately my base library lacks the state monad transformer, so we swipe some code from GHC:

data StateT s m a = StateT { runStateT :: s -> m (a,s) }
instance (Monad m) => Monad (StateT s m) where
    return a = StateT $ \ s -> return (a, s)
    m >>= k  = StateT $ \ s -> do
        (a, s') <- runStateT m s
        runStateT (k a) s'
instance (Functor m) => Functor (StateT s m) where
    fmap f m = StateT $ \ s ->
        fmap (\(a, s') -> (f a, s')) $ runStateT m s
instance (Functor m, Monad m) => Applicative (StateT s m) where
    pure a = StateT $ \ s -> return (a, s)
    StateT mf <*> StateT mx = StateT $ \ s -> do
        (f, s') <- mf s
        (x, s'') <- mx s'
        return (f x, s'')
    m *> k = m >>= \_ -> k
state f = StateT (return . f)
getT = state $ \ s -> (s, s)
putT s = state $ \ _ -> ((), s)
lift m = StateT $ \ s -> do
    a <- m
    return (a, s)

To finish off, we traverse the expression and unify types along the way to enforce the rules:

The type of a primitive value is its corresponding type; for example, 5 has type TC "Int" and "Hello, World" has type TC "String".
For an application f x, we recursively determine the type tf of f and tx of x, possibly requiring new assignments. We return a new type variable tfx, and unify tf with tx :-> tfx after applying any new assignments.
For a lambda abstraction \x.t, we generate a new type variable tx to represent the type of x. Then we recursively find the type tt of t during which we assign the type tx to any free occurrence of x, then return the type tx :-> tt.

Let’s walk through an example. The expression \f -> f 2 would be represented as the Expr tree Lam "f" (V "f" :@ I 2). To type this expression:

At the top level we have a lambda so begin by generating a new type variable t0.
Recursively type the lambda body, with the additional context that the symbol f has type t0.
1. Generate a new type variable t1.
2. Recursively type the left child of the (:@) node:
  1. We have the symbol f, which has type t0 from the context.
3. Recursively type the right child of the (:@) node:
  1. We have the integer constant 2, so we return the type TC "Int".
4. Unify t0 with TC "Int" -> t1, which assigns the latter to the former.
5. Return the type t1.
Return the type t0 -> t1.

Thus \f -> f 2 has type t0 -> t1 which expands to (Int -> t1) -> t1 if we apply all the assignments we found.

We use StateT to help with bookkeeping. To guarantee a unique name for each type variable, we maintain a counter which we increment for each new name, which is stored along with the type variable assignments collected so far.

The environment gamma holds types of any predefined functions, and also records the types of variables in lambda abstractions.

We ultimately return the type of an expression, or an error message explaining why it is ill-typed. We name the function gather because it accumulates more and more type variable assignments over time, but also because "to gather" can mean "to infer".

gather :: [(String, Type)] -> Expr ->
  StateT ([(String, Type)], Int) (Either String) Type
gather gamma = \case
  I _ -> pure $ TC "Int"
  S _ -> pure $ TC "String"
  f :@ x -> do
    tf <- gather gamma f
    tx <- gather gamma x
    tfx <- newTV
    (cs, i) <- getT
    let rewrite t = foldr tsub t cs
    cs' <- lift $ unify (rewrite tf) (rewrite $ tx :-> tfx)
    putT ((second rewrite <$> cs') ++ cs, i)
    pure tfx
  V s -> lift $ maybe (Left $ "missing " ++ s) Right $ lookup s gamma
  Lam x t -> do
    tx <- newTV
    (tx :->) <$> gather ((x, tx):gamma) t
  where
  newTV = do
    (cs, i) <- getT
    putT (cs, i + 1)
    pure $ TV $ 't':show i

This algorithm is the heart of the Hindley-Milner type system, or HM for short. See Mark P. Jones, Typing Haskell in Haskell.

The infer wrapper shows the type returned by gather with all assignments applied, or returns an error message:

infer gamma x = case runStateT (gather gamma x) ([], 0) of
  Right (ty, (cs, _)) -> show $ foldr tsub ty cs
  Left e -> "FAIL: " ++ e
infer [] (Lam "x" (V "x" :@ I 2))

We build a demo with a couple of predefined functions with certain types:

prelude :: [(String, Type)]
prelude = [
  ("+", TC "Int" :-> TC "Int" :-> TC "Int"),
  ("length", TC "String" :-> TC "Int")]

demo expr = do
  putStrLn $ "type of `" ++ show expr ++ "`:"
  putStrLn $ "  " ++ infer prelude expr ++ "\n"

demo $ V "length"
demo $ V "length" :@ S "hello"
demo $ V "length" :@ V "length"
demo $ V "+" :@ I 1 :@ I 1
demo $ V "+" :@ I 1 :@ S "hello"
demo $ Lam "x" (V "x" :@ I 2)
demo $ Lam "x" (V "x" :@ V "x")
demo $ Lam "x" ((V "+" :@ V "x") :@ I 42)
demo $ Lam "x" (V "+" :@ (V "x" :@ I 42))
demo $ Lam "x" (V "x")
demo $ Lam "x" (Lam "y" (V "x"))
demo $ Lam "x" (Lam "y" (Lam "z" ((V "x" :@ V "z") :@ (V "y" :@ V "z"))))

Ben Lynn blynn@cs.stanford.edu 💡