Symbol Codes

Having explored lossy data compression, we turn our attention to lossless data compression. What’s the best we can do?

Unconditional lossless compression is impossible. An invertible function that shortens some inputs must lengthen others. We therefore strive for the next best thing: a high probability of compression.

Again, we follow David Mackay, Information Theory, Inference and Learning Algorithms, and reuse code from a previous page, as well as some definitions cribbed from GHC libraries:

▶ Toggle reused code

acoth k = sum $ take 16 $ zipWith (*) denoms $ iterate ((r*r)*) r
  where
  r = 1/k
  denoms = recip <$> iterate (2+) 1.0

log2 = 18*acoth 26 - 2*acoth 4801 + 8*acoth 8749

lg x
  | x == 1 = 0
  | x < 1  = lg (x*2) - 1
  | x >= 2 = lg (x/2) + 1
  | otherwise = 2 * acoth ((x + 1)/(x - 1)) / log2

shannon p = -(lg p)

type RandomVariable a = [(a, Double)]
mean :: (Double -> Double) -> RandomVariable a -> Double
mean f rv = sum [p * f p | (_, p) <- rv]
entropy :: RandomVariable a -> Double
entropy rv = mean shannon rv
size = fromIntegral . length
rawBitContent rv = lg $ size rv

englishSrc = unlines
  [ "a 8167"
  , "b 1492"
  , "c 2782"
  , "d 4253"
  , "e 12702"
  , "f 2228"
  , "g 2015"
  , "h 6094"
  , "i 6966"
  , "j 0153"
  , "k 0772"
  , "l 4025"
  , "m 2406"
  , "n 6749"
  , "o 7507"
  , "p 1929"
  , "q 0095"
  , "r 5987"
  , "s 6327"
  , "t 9056"
  , "u 2758"
  , "v 0978"
  , "w 2360"
  , "x 0150"
  , "y 1974"
  , "z 0077"
  ]

parse :: String -> RandomVariable Char
parse s = second divTotal <$> tab where
  tab = [(c, fromIntegral $ readInteger s) | [[c], s] <- words <$> lines s]
  divTotal = (/ sum (snd <$> tab))

english = parse englishSrc

cumu rv = scanl1 (\(_,p) (b,q) -> (b,p + q)) $ sortOn snd rv
sample rv r = fst $ head $ dropWhile ((< r) . snd) $ cumu rv
parseFrac = uncurry (/) . foldl (\(a, b) d
  -> (a*10 + fromIntegral (ord d - ord '0'), b*10)) (0.0, 1)
jsrandom = parseFrac . drop 2 <$> jsEval "Math.random()"

data Tree a = Node { rootLabel :: a, subForest :: [Tree a] }

sortOn f = \case
  [] -> []
  x:xt -> sortOn f (filter ((<= fx) . f) xt)
    ++ [x] ++ sortOn f (filter ((> fx) . f) xt) where fx = f x
sort = sortOn id

isPrefixOf s t = and $ zipWith (==) s t

groupBy :: (a -> a -> Bool) -> [a] -> [[a]]
groupBy p' = \case
  [] -> []
  x':xs' -> (x' : ys') : zs' where
    (ys',zs') = go p' x' xs'
    go p z = \case
      [] -> ([], [])
      x:xs
        | p z x -> (x : ys, zs)
        | otherwise -> ([], (x : ys) : zs)
        where (ys, zs) = go p x xs
group = groupBy (==)

jsEval "hideshow('reuse');"

A simple strategy for lossless compression is to map each symbol to some nonempty string of bits \(\{0, 1\}^+\) and assign short codes to probable symbols and longer codes to improbable symbols, as Morse code does. Such a scheme is known as a symbol code:

codeword c x = maybe (error "bad symbol") id $ lookup x c

As before \(\mathcal{A}_X\) denotes the set of outcomes of \(X\).

We encode a message in \(\mathcal{A}_X^+\) with \(C\) via the extended code \(C^+\) which concatenates the codes of each symbol.

extendedCode c s = concatMap (codeword c) s

Unlike Morse code, there are no gaps between encoded symbols, so we must worry about ambiguity when there are multiple symbols.

A code is uniquely decodable if extendedCode is injective. One way to ensure a code is uniquely decodable is to stipulate that no codeword is the prefix of another codeword, that is, the code is a prefix code. This also simplifies decoding.

isPrefixCode c = not $ or
  [s `isPrefixOf` t | (x, s) <- c, (y, t) <- c, x /= y]

-- This code is uniquely decodable, but not prefix:
isPrefixCode [('a', "0"), ('b', "01")]
extendedCode [('a', "0"), ('b', "01")] "abba"

The expected length \(L(C,X)\) for a symbol code \(C\) for a random variable \(X\) is the average encoding length:

\[ L(C,X) = \sum P(x) l(x) [x \in \mathcal{A}_X] \]

where \(l(x)\) is the length of \(x\) encoded using \(C\).

meanX :: (a -> Double -> Double) -> RandomVariable a -> Double
meanX f rv = sum [p * f x p | (x, p) <- rv]
expectedLength c rv = meanX (\x _ -> size $ codeword c x) rv

We seek to minimize the expected length.

Kraft-McMillan Inequality

Unique decodability imposes stringent limits on the lengths of our codewords. We can only shorten some codewords at the expense of lengthening others, a balancing act quantified by the following.

The Kraft-McMillan inequality: if \(C\) is a uniquely decodable code for \(X\) then

\[ \sum 2^{-l(x)} [x \in \mathcal{A}_X] \le 1 \]

kraftSum = sum . map (0.5^)

kraftSum [2,3,1,3]
kraftSum [2,3,1,2]  -- No uniquely decodable code can have these lengths.

Moreover, given a set of lengths satisfying this inequality, we can construct a corresponding prefix code.

Proof: Define \(z(X) = \sum 2^{-l(x)} [x \in \mathcal{A}_X]\). We wish to show \(z(X) \le 1\).

Let \(N\) be a constant natural. Consider:

\[ z(X^N) = \sum 2^{-l(x)} [x \in \mathcal{A}_{X^N}] \]

We count this in two different ways. Firstly, by basic combinatorics, \(z(X^N) = z(X)^N\).

Secondly, if \(a(N, k)\) denotes the number of strings of length \(N\) with a \(k\)-bit encoding, then:

\[ z(X^N) = \sum 2^{-k} a(N, k) \]

Since \(C\) is uniquely decodable, \(a(N, k) \le 2^k\) for all \(k\). Furthermore, \(a(N, k) = 0\) for \(k > N l_\max\) where

\[ l_\max = \max \{ l(x) : x \in \mathcal{A}_X \} \]

Thus:

\[ \sum 2^{-k} a(N, k) \le N l_\max \]

Combining the two:

\[ z(X)^N = z(X^N) \le N l_\max \]

The right-hand side grows linearly with \(N\). Hence \(z(X) \le 1\), otherwise the left-hand side grows exponentially with \(N\) and must eventually exceed the right-hand side.

Conversely, suppose we are given lengths satisfying the Kraft inequality.

Sort them in increasing order and group together identical lengths, so that we have \(a_k\) copies of the length \(l_k\) and \(l_0 < … < l_n\).

Reset a binary counter of length \(l_0\). Pick its first \(a_0\) values to be codewords.

For \(0 < k \le n\), suppose we have already generated codewords of lengths \(l_i\) for all \(i < k\), and that we have a binary counter showing some value. Increment the counter, append \(l_k - l_{k-1}\) zero bits to the counter, then pick the next \(a_k\) values of the counter to be codewords.

fromKraft lengths = map reverse $ go 0 [] $ group $ sort lengths where
  go _ _ [] = []
  go k ctr (ls@(l:_):rest) = xs ++ go l ctr' rest where
    (xs, ctr':_) = splitAt (length ls)
      $ iterate inc $ replicate (l - k) '0' ++ ctr
    inc [] = []
    inc ('0':bs) = '1':bs
    inc ('1':bs) = '0':inc bs

fromKraft [1,1]
fromKraft [1,2,2]
fromKraft [2,2,3,3,3,4,4]

The binary counter ensures our code is prefix, and the Kraft inequality guarantees the binary counter never overflows. For example:

\[ a_0 2^{-l_0} + a_1 2^{-l_1} + … + a_n 2^{-l_n} \le 1 \]

Mulitplying both sides by \(2^{l_0}\) yields:

\[ a_0 + a_1 2^{l_0 - l_1} + … + a_n 2^{l_0 - l_n} \le 2^{l_0} \]

Thus an \(l_0\)-bit counter will provide enough distinct values to get started. Mulitplying further by \(2^{l_1 - l_0}\):

\[ a_1 + a_2 2^{l_1-l_2} … + a_n 2^{l_1 - l_n} \le 2^{l_1} - a_0 2^{l_1 - l_0} \]

which means an \(l_1\)-bit counter can supply enough values for \(a_1\) codewords of length \(l_1\) even after accounting for those values forbidden by the \(a_0\) codewords of length \(l_0\). We induct to show this for all \(k\).

If we have equality then \(C\) is called a complete code.

Gibb’s inequality

Given two probability distributions \(P(x), Q(x)\) over the same alphabet \(\mathcal{A}_X\), the relative entropy or Kullback-Leibler divergence between them is:

\[ D_{KL}(P || Q) = \sum P(x) \lg \frac{P(x)}{Q(x)} \]

pr rv x = maybe (error "bad symbol") id $ lookup x rv
relativeEntropy rvP rvQ = meanX (\x p -> lg $ p / pr rvQ x) rvP

Relative entropy is not necessarily symmetric. But it does satisfy:

Gibb’s inequality:

\[ D_{KL}(P || Q) \ge 0 \]

with equality if and only if \(P = Q\).

Proof: Apply Jensen’s inequality using the function \(f(u) = - \lg u\) where \(u = Q(x)/P(x)\).

Jensen’s inequality: Let \(f\) be a convex function. Then \(E[f(X)] \ge f(E[X])\).

Limits of compression

Let \(z = \sum 2^{-l(x)} [x \in \mathcal{A}_X]\). The implicit probabilities defined by codelengths are given by \[ Q(x) = 2^{-l(x)} / z \] for all \(x \in \mathcal{A}_X\).

Then:

\[ L(C,X) = \sum P(x) l(x) = \sum -P(x) \lg Q(x) - \lg z \]

By Gibb’s inequality and Kraft’s inequality:

\[ L(C,X) \ge \sum -P(x) \lg P(x) - \lg z \ge H(X) \]

and \(L(C, X) = H(X)\) if and only if \(z = 1\) and \(l(x) = \lg 1/P(x) = h(x)\).

In other words, the expected length of any code is bounded below by \(H(X)\). We attain \(H(X)\) when \(P(x) = Q(x)\) for all \(x\).

Theorem: For any random variable \(X\) there exists a prefix code \(C\) such that:

\[ H(X) \le L(C,X) < H(X) + 1 \]

Proof: Set \(l(x) = \lceil h(x) \rceil\). Then

\[ \sum 2^{-l(x)} \le \sum 2^{-h(x)} = \sum P(x) = 1\]

Since the Kraft inequality is satisfied, a prefix code with these lengths exists. Also:

\[ L(C,X) = \sum P(x) \lceil h(x) \rceil < \sum P(x) (h(x) + 1) = H(X) + 1 \]

ceilingCode rv = zip (reverse $ fst <$> sortOn snd rv) ws where
  ws = fromKraft $ map (\(_, p) -> ceiling $ shannon p) rv

ceilingCode english
isPrefixCode $ ceilingCode english
extendedCode (ceilingCode english) "someenglishwords"

We’ve seen that for lossy compression, we can achieve \(H(X)\) bits per symbol and no better. Lossless compression turns out to be similar: we can do no better than \(H(X)\) bits per symbol, and we can always do better than \(H(X) + 1\) bits per symbol.

Unlikely lossy compression, these bounds aren’t just theory. This algorithm, due to Shannon, is practical. However, it may not be optimal. For example, suppose the symbols A, B, C each have a 1/3 chance of occurring. Since \(\lceil \lg 3 \rceil = 2\), we’d assign a 2-bit codeword to each symbol, but we can easily do better by changing one of them to a 1-bit codeword. See also Fano’s method, which is similarly flawed.

uniform3 = zip "ABC" $ repeat (1/3)

ceilingCode uniform3

Happily there is a better algorithm that always produces an optimal symbol code.

Huffman

What can we say about an optimal prefix code for \(X\)? Namely, a prefix code \(C\) that minimizes \(L(C,X)\)?

If there is only one symbol then finding an optimal code is trivial, so assume at least two symbols. Take the least probable symbol \(x_0\), breaking any ties arbitrarily. To minimize \(L(C,X)\), we must give \(x_0\) one of the longest codewords, that is, the codeword \(c\) for \(x_0\) must have length:

\[ l_\max = \max \{ l(x) : x \in \mathcal{A}_X \} \]

If we remove the last bit of \(c\), we must have the prefix of some other codeword of \(C\), otherwise we could do without the last bit to get a better code than \(C\), a contradiction. Since \(c\) has length \(l_\max\), this other codeword must be \(c\) with its last bit flipped.

Label these two codewords \(c_0, c_1\), and let \(c'\) be their common prefix. Take the second least probable symbol \(x_1\), again breaking any ties arbitrarily. If \(x_0, x_1\) do not already correspond to \(c_0, c_1\), we swap codewords so they do. This preserves \(L(C,X)\) because \(C\) is optimal; all such swaps must involve codewords of length \(l_\max\).

Consider the random variable \(X'\) that is identical to \(X\) except that instead of the symbols \(x_0\) and \(x_1\) with probabilities \(p_0\) and \(p_1\), we have a single symbol \(x'\) with probability \(p_0 + p_1\). We may encode \(X'\) with a code \(C'\) that is identical to \(C\) except we use \(c'\) to encode \(x'\). Then \(L(C,X) = L(C',X') + p_0 + p_1\).

The code \(C'\) is also optimal. For suppose \(L(D',X') < L(C',X')\) for some code \(D'\). Then let \(d'\) be the encoding of \(x'\) under \(D'\). If we map \(x_0\) and \(x_1\) to \(d'\) followed by one bit, we have a code \(D\) for \(X\) satisfying:

\[ L(D,X) = L(D',X') + p_0 + p_1 < L(C',X') + p_0 + p_1 = L(C,X) \]

which is impossible since \(C\) is optimal.

These observations lead to Huffman coding: an algorithm that produces an optimal coding. Starting with the symbols of \(X\), repeatedly take the two most improbable symbols, breaking ties arbitrarily, label them 0 and 1, then treat them as a single symbol with their combined probability.

Eventually, only one symbol remains. The codeword for a symbol of \(X\) can be recovered by reading the labels as we retrace our footsteps. This algorithm is usually described as a bottom-up construction of a binary tree, which is reflected in our Haskell.

huffman rv = go $ map (\(x, p) -> (Node x [], p)) rv where
  go [(root, _)] = walk id root
  go ts = go $ (Node undefined [a, b], p + q):rest where
    ([(a, p), (b, q)], rest) = splitAt 2 $ sortOn snd ts
  walk pre (Node x []) = [(x, pre [])]
  walk pre (Node _ [a, b]) = walk (pre . ('0':)) a
                          ++ walk (pre . ('1':)) b

huffman uniform3
huffman english

If we insist that all codewords have the same length, then the Latin alphabet needs 5 bits per symbol; see the Baudot code and Bacon’s cipher. For the english distribution, both Shannon and Huffman codes have less than 5 expected bits per symbol, though Huffman gets much closer to the entropy, which is the theoretical limit.

expectedLength (ceilingCode english) english
expectedLength (huffman english) english
entropy english

Symbol codes fare poorly on lopsided distributions like the following. The expected length is about 90 times larger than the entropy!

lop = zip "ab" [0.001,0.999]
huffman lop
expectedLength (huffman lop) lop
entropy lop

Schemes like run-length encoding would perform well here, but are forbidden when each symbol must be encoded individually.

Nonetheless, Huffman coding is easy to implement and often effective in practice. We demonstrate on some randomly generated english strings:

demo n rv showouts = do
  x <- replicateM n $ sample rv <$> jsrandom
  putStrLn $ "message: " ++ x
  let
    outS = extendedCode (ceilingCode rv) x
    outH = extendedCode (huffman rv) x
    perN = (/ fromIntegral n :: Double)
  putStrLn $ "Shannon: bits per symbol = " ++ show (perN $ size outS)
  when showouts $ putStrLn outS
  putStrLn $ "Huffman: bits per symbol = " ++ show (perN $ size outH)
  when showouts $ putStrLn outH
  putStrLn ""

demo 50 english True
replicateM 3 $ demo 50 english False

Ben Lynn blynn@cs.stanford.edu 💡

Symbol Codes

Kraft-McMillan Inequality

Gibb’s inequality

Limits of compression

Huffman

Contents