Symbol Codes

{-# LANGUAGE CPP #-}
#ifdef __HASTE__
import Haste.DOM
import Haste.Events
import Control.Monad (void)
#endif
import Data.List (isPrefixOf, group, sort, sortOn, genericLength)
import Data.Maybe (catMaybes)
import Data.Tree
import System.Random

How much can we compress data losslessly?

Unconditional lossless compression is impossible. An invertible function that shortens some inputs must lengthen others. We therefore strive for the next best thing: a high probability of compression.

One strategy is to map each symbol to some nonempty string of bits \(\{0, 1\}^+\) and assign short codes to probable symbols and longer codes to improbable symbols, as Morse code does. Such a scheme is known as a symbol code:

type Code = [(Char, [Bool])]
type RandomVariable a = [(a, Double)]
codeword c x = maybe (error "bad symbol") id $ lookup x c

As before \(\mathcal{A}_X\) denotes the set of outcomes of \(X\).

We encode a message in \(\mathcal{A}_X^+\) with \(C\) via the extended code \(C^+\) which concatenates the codes of each symbol.

extendedCode c s = concatMap (codeword c) s

A code is uniquely decodable if this function is injective. One way to ensure a code is uniquely decodable is to stipulate that no codeword is the prefix of another codeword, that is, the code is a prefix code. This also simplifies decoding.

isPrefixCode c = not $ or
  [s `isPrefixOf` t | (x, s) <- c, (y, t) <- c, x /= y]

The expected length \(L(C,X)\) for a symbol code \(C\) for a random variable \(X\) is the average encoding length:

\[ L(C,X) = \sum P(x) l(x) [x \in \mathcal{A}_X] \]

where \(l(x)\) is the length of \(x\) encoded using \(C\).

mean :: (a -> Double -> Double) -> RandomVariable a -> Double
mean f rv = sum [p * f x p | (x, p) <- rv]
expectedLength c rv = mean (\x _ -> genericLength $ codeword c x) rv

We seek to minimize the expected length.

Kraft-McMillan Inequality

Unique decodability imposes stringent limits on the lengths of our codewords. We can only shorten some codewords at the expense of lengthening others, a balancing act quantified by the following.

The Kraft-McMillan inequality: if \(C\) is a uniquely decodable code for \(X\) then

\[ \sum 2^{-l(x)} [x \in \mathcal{A}_X] \le 1 \]

lhsKraft ls = sum [2^^(-l) | l <- ls]
isKraft ls = lhsKraft ls <= 1

Moreover, given a set of lengths satisfying this inequality, we can construct a corresponding prefix code.

Proof: Define \(z(X) = \sum 2^{-l(x)} [x \in \mathcal{A}_X]\). We wish to show \(z(X) \le 1\).

Let \(N\) be a constant natural. Consider:

\[ z(X^N) = \sum 2^{-l(x)} [x \in \mathcal{A}_{X^N}] \]

We count this in two different ways. Firstly, by basic combinatorics, \(z(X^N) = z(X)^N\).

Secondly, if \(a(N, k)\) denotes the number of strings of length \(N\) with a \(k\)-bit encoding, then:

\[ z(X^N) = \sum 2^{-k} a(N, k) \]

Since \(C\) is uniquely decodable, \(a(N, k) \le 2^k\) for all \(k\). Furthermore, \(a(N, k) = 0\) for \(k > N l_\max\) where

\[ l_\max = \max \{ l(x) : x \in \mathcal{A}_X \} \]

Thus:

\[ \sum 2^{-k} a(N, k) \le N l_\max \]

Combining the two:

\[ z(X)^N = z(X^N) \le N l_\max \]

The right-hand side grows linearly with \(N\). Hence \(z(X) \le 1\), otherwise the left-hand side grows exponentially with \(N\) and must eventually exceed the right-hand side.

Conversely, suppose we are given lengths satisfying the Kraft inequality.

Sort them in increasing order and group together identical lengths, so that we have \(a_k\) copies of the length \(l_k\) and \(l_0 < …​ < l_n\).

Reset a binary counter of length \(l_0\). Pick its first \(a_0\) values to be codewords.

For \(0 < k \le n\), suppose we have already generated codewords of lengths \(l_i\) for all \(i < k\), and that we have a binary counter showing some value. Increment the counter, append \(l_k - l_{k-1}\) zero bits to the counter, then pick the next \(a_k\) values of the counter to be codewords.

fromKraft lengths = map reverse $ go 0 [] $ group $ sort lengths where
  go _ _ [] = []
  go k ctr (ls@(l:_):rest) = xs ++ go l ctr' rest where
    (xs, ctr':_) = splitAt (length ls)
      $ iterate inc $ replicate (l - k) False ++ ctr
    inc [] = []
    inc (False:bs) = True:bs
    inc (True:bs) = False:inc bs

The binary counter ensures our code is prefix, and the Kraft inequality guarantees the binary counter never overflows. For example:

\[ a_0 2^{-l_0} + a_1 2^{-l_1} + …​ + a_n 2^{-l_n} \le 1 \]

Mulitplying both sides by \(2^{l_0}\) yields:

\[ a_0 + a_1 2^{l_0 - l_1} + …​ + a_n 2^{l_0 - l_n} \le 2^{l_0} \]

Thus an \(l_0\)-bit counter will provide enough distinct values to get started. Mulitplying further by \(2^{l_1 - l_0}\):

\[ a_1 + a_2 2^{l_1-l_2} …​ + a_n 2^{l_1 - l_n} \le 2^{l_1} - a_0 2^{l_1 - l_0} \]

which means an \(l_1\)-bit counter can supply enough values for \(a_1\) codewords of length \(l_1\) even after accounting for those values forbidden by the \(a_0\) codewords of length \(l_0\). We induct to show this for all \(k\).

If we have equality then \(C\) is called a complete code.

Gibb’s inequality

Given two probability distributions \(P(x), Q(x)\) over the same alphabet \(\mathcal{A}_X\), the relative entropy or Kullback-Leibler divergence between them is:

\[ D_{KL}(P || Q) = \sum P(x) \lg \frac{P(x)}{Q(x)} \]

lg = logBase 2
prob rv x = maybe (error "bad symbol") id $ lookup x rv
relativeEntropy rvP rvQ = mean (\x p -> lg $ p / prob rvQ x) rvP

Relative entropy is not necessarily symmetric. But it does satisfy:

Gibb’s inequality:

\[ D_{KL}(P || Q) \ge 0 \]

with equality if and only if \(P = Q\).

Proof: Apply Jensen’s inequality using the function \(f(x) = - \lg x\).

Jensen’s inequality: Let \(f\) be a convex function. Then \(E[f(X)] \ge f(E[X])\).

Limits of compression

Let \(z = \sum 2^{-l(x)} [x \in \mathcal{A}_X]\). The implicit probabilities defined by codelengths are given by \[ Q(x) = 2^{-l(x)} / z \] for all \(x \in \mathcal{A}_X\).

Then:

\[ L(C,X) = \sum P(x) l(x) = \sum -P(x) \lg Q(x) - \lg z \]

By Gibb’s inequality and Kraft’s inequality:

\[ L(C,X) \ge \sum -P(x) \lg P(x) - \lg z \ge H(X) \]

and \(L(C, X) = H(X)\) if and only if \(z = 1\) and \(l(x) = \lg 1/P(x) = h(x)\).

In other words, the expected length of any code is bounded below by \(H(X)\). We attain \(H(X)\) when \(P(x) = Q(x)\) for all \(x\).

Theorem: For any random variable \(X\) there exists a prefix code \(C\) such that:

\[ H(X) \le L(C,X) < H(X) + 1 \]

Proof: Set \(l(x) = \lceil h(x) \rceil\). Then

\[ \sum 2^{-l(x)} \le \sum 2^{-h(x)} = \sum P(x) = 1\]

Since the Kraft inequality is satisfied, a prefix code with these lengths exists. Also:

\[ L(C,X) = \sum P(x) \lceil h(x) \rceil < \sum P(x) (h(x) + 1) = H(X) + 1 \]

ceilingCode rv = zip (reverse $ fst <$> sortOn snd rv) ws where
  ws = fromKraft $ map (\(_, p) -> ceiling $ shannon p) rv

This code may not be optimal. For example, suppose the symbols A, B, C each have a 1/3 chance of occurring. Since \(\lceil \lg 3 \rceil = 2\), we’d assign a 2-bit codeword to each symbol, but we can easily do better by changing one of them to a 1-bit codeword.

We’ve seen that for lossy compression, we can achieve \(H(X)\) bits per symbol and no better. Lossless compression turns out to be similar: we can do no better than \(H(X)\) bits per symbol, and we can always do better than \(H(X) + 1\) bits per symbol.

Unlikely lossy compression, these bounds aren’t just theory. The above algorithm is practical, and in fact, there is a better algorithm that always produces an optimal symbol code.

Huffman

What can we say about an optimal prefix code for \(X\)? Namely, a prefix code \(C\) that minimizes \(L(C,X)\)?

If there is only one symbol then finding an optimal code is trivial, so assume at least two symbols. Take the least probable symbol \(x_0\), breaking any ties arbitrarily. To minimize \(L(C,X)\), we must give \(x_0\) one of the longest codewords, that is, the codeword \(c\) for \(x_0\) must have length:

\[ l_\max = \max \{ l(x) : x \in \mathcal{A}_X \} \]

If we remove the last bit of \(c\), we must have the prefix of some other codeword of \(C\), otherwise we could do without the last bit to get a better code than \(C\); a contradiction. Since \(c\) has length \(l_\max\), this other codeword must be \(c\) with its last bit flipped.

Label these two codewords \(c_0, c_1\), and let \(c'\) be their common prefix. Take the second least probable symbol \(x_1\), again breaking any ties arbitrarily. If \(x_0, x_1\) do not already correspond to \(c_0, c_1\), we swap codewords so they do. This preserves \(L(C,X)\) because \(C\) is optimal; all such swaps must involve codewords of length \(l_\max\).

Consider the random variable \(X'\) that is identical to \(X\) except that instead of the symbols \(x_0\) and \(x_1\) with probabilities \(p_0\) and \(p_1\), we have a single symbol \(x'\) with probability \(p_0 + p_1\). We may encode \(X'\) with a code \(C'\) that is identical to \(C\) except we use \(c'\) to encode \(x'\). Then \(L(C,X) = L(C',X') + p_0 + p_1\).

The code \(C'\) is also optimal. For suppose \(L(D',X') < L(C',X')\) for some code \(D'\). Then let \(d'\) be the encoding of \(x'\) under \(D'\). If we map \(x_0\) and \(x_1\) to \(d'\) followed by one bit, we have a code \(D\) for \(X\) satisfying:

\[ L(D,X) = L(D',X') + p_0 + p_1 < L(C',X') + p_0 + p_1 = L(C,X) \]

which is impossible since \(C\) is optimal.

These observations lead to Huffman coding: an algorithm that produces an optimal coding. Starting with the symbols of \(X\), repeatedly take the two most improbable symbols, breaking ties arbitrarily, label them 0 and 1, then treat them as a single symbol with their combined probability.

Eventually, only one symbol remains. The codeword for a symbol of \(X\) can be recovered by reading the labels as we retrace our footsteps. This algorithm is usually described as a bottom-up construction of a binary tree, which is reflected in our Haskell.

huffman rv = go $ map (\(x, p) -> (Node x [], p)) rv where
  go [(root, _)] = walk id root
  go ts = go $ (Node undefined [a, b], p + q):rest where
    ([(a, p), (b, q)], rest) = splitAt 2 $ sortOn snd ts
  walk pre (Node x []) = [(x, pre [])]
  walk pre (Node _ [a, b]) = walk (pre . (False:)) a
                          ++ walk (pre . (True:))  b

Examples

David Mackay, Information Theory, Inference and Learning Algorithms lists several drawbacks of symbol codes, which can be explored in our demo above. We reuse much of our code from before.

english = unlines  -- From Wikipedia.
  [ "a 8167"
  , "b 1492"
  , "c 2782"
  , "d 4253"
  , "e 12702"
  , "f 2228"
  , "g 2015"
  , "h 6094"
  , "i 6966"
  , "j 0153"
  , "k 0772"
  , "l 4025"
  , "m 2406"
  , "n 6749"
  , "o 7507"
  , "p 1929"
  , "q 0095"
  , "r 5987"
  , "s 6327"
  , "t 9056"
  , "u 2758"
  , "v 0978"
  , "w 2360"
  , "x 0150"
  , "y 1974"
  , "z 0077"
  ]

twoDice = unlines
  [ "2 1"
  , "3 2"
  , "4 3"
  , "5 4"
  , "6 5"
  , "7 6"
  , "8 5"
  , "9 4"
  , "a 3"
  , "b 2"
  , "c 1"
  ]

parse :: String -> RandomVariable Char
parse s = (\(c, f) -> (c, fromIntegral f / fromIntegral total)) <$> tab
  where
  tab = catMaybes $ charFreq . filter (/= ' ') <$> lines s
  total = sum $ snd <$> tab
  charFreq "" = Nothing
  charFreq (c:n) = Just (c, read n)

entropy :: RandomVariable a -> Double
entropy rv = mean (\_ p -> shannon p) rv

shannon p = - lg p

cumu rv = scanl1 (\(_,p) (b,q) -> (b,p + q)) $ sortOn snd rv

roll rv n g
  | n == 0 = []
  | otherwise = x:roll rv (n - 1) g'
  where
  (r, g') = random g
  x = fst $ head $ dropWhile ((< r) . snd) $ cumu rv

showCodeword = concatMap $ show . fromEnum

showHuffman :: [(Char, [Bool])] -> String
showHuffman tab = unlines $ map (\(c, w) -> c:' ':showCodeword w) tab

go code g s = let
  rv = parse s
  x = roll rv 100 g
  cws = code rv
  enc = concatMap (\c -> maybe undefined showCodeword $ lookup c cws) x
  in unlines
  [ "H(X) = " ++ show (entropy rv)
  , "expected length of code: " ++ show (expectedLength cws rv)
  , "sample X^100: " ++ x
  , "length in bits: " ++  show (length enc)
  , enc
  , "code:"
  , showHuffman cws
  ]

#ifdef __HASTE__
main :: IO ()
main = withElems ["in", "out", "shannon", "huffman"] $ \[iEl, oEl, shannon, huffmanB] -> do
  let
    setInp s = setProp iEl "value" s >> setProp oEl "value" ""
    handle buttonId s = do
      Just b <- elemById buttonId
      void $ b `onEvent` Click $ const $ setInp s
  setInp english
  handle "english" english
  handle "fair" "h 1\nt 1\n"
  handle "foul" "h 9\nt 1\n"
  handle "foul2" "H 81\nh 9\nt 9\nT 1\n"
  handle "2d6" twoDice
  handle "roshambo" "R 1\nP 1\nS 1\n"
  let
    showOff code = do
      g <- newStdGen
      setProp oEl "value" . go code g =<< getProp iEl "value"
  void $ huffmanB `onEvent` Click $ const $ showOff huffman
  void $ shannon `onEvent` Click $ const $ showOff ceilingCode
#endif

Ben Lynn blynn@cs.stanford.edu 💡