acoth k = sum $ take 16 $ zipWith (*) denoms $ iterate ((r*r)*) r where r = 1/k denoms = recip <$> iterate (2+) 1.0 log2 = 18*acoth 26 - 2*acoth 4801 + 8*acoth 8749 lg x | x == 1 = 0 | x < 1 = lg (x*2) - 1 | x >= 2 = lg (x/2) + 1 | otherwise = 2 * acoth ((x + 1)/(x - 1)) / log2 shannon p = -(lg p) type RandomVariable a = [(a, Double)] mean :: (Double -> Double) -> RandomVariable a -> Double mean f rv = sum [p * f p | (_, p) <- rv] entropy :: RandomVariable a -> Double entropy rv = mean shannon rv size = fromIntegral . length rawBitContent rv = lg $ size rv englishSrc = unlines [ "a 8167" , "b 1492" , "c 2782" , "d 4253" , "e 12702" , "f 2228" , "g 2015" , "h 6094" , "i 6966" , "j 0153" , "k 0772" , "l 4025" , "m 2406" , "n 6749" , "o 7507" , "p 1929" , "q 0095" , "r 5987" , "s 6327" , "t 9056" , "u 2758" , "v 0978" , "w 2360" , "x 0150" , "y 1974" , "z 0077" ] parse :: String -> RandomVariable Char parse s = second divTotal <$> tab where tab = [(c, fromIntegral $ readInteger s) | [[c], s] <- words <$> lines s] divTotal = (/ sum (snd <$> tab)) english = parse englishSrc cumu rv = scanl1 (\(_,p) (b,q) -> (b,p + q)) $ sortOn snd rv sample rv r = fst $ head $ dropWhile ((< r) . snd) $ cumu rv parseFrac = uncurry (/) . foldl (\(a, b) d -> (a*10 + fromIntegral (ord d - ord '0'), b*10)) (0.0, 1) jsrandom = parseFrac . drop 2 <$> jsEval "Math.random()" data Tree a = Node { rootLabel :: a, subForest :: [Tree a] } sortOn f = \case [] -> [] x:xt -> sortOn f (filter ((<= fx) . f) xt) ++ [x] ++ sortOn f (filter ((> fx) . f) xt) where fx = f x sort = sortOn id isPrefixOf s t = and $ zipWith (==) s t groupBy :: (a -> a -> Bool) -> [a] -> [[a]] groupBy p' = \case [] -> [] x':xs' -> (x' : ys') : zs' where (ys',zs') = go p' x' xs' go p z = \case [] -> ([], []) x:xs | p z x -> (x : ys, zs) | otherwise -> ([], (x : ys) : zs) where (ys, zs) = go p x xs group = groupBy (==) jsEval "hideshow('reuse');"
Symbol Codes
Having explored lossy data compression, we turn our attention to lossless data compression. What’s the best we can do?
Unconditional lossless compression is impossible. An invertible function that shortens some inputs must lengthen others. We therefore strive for the next best thing: a high probability of compression.
Again, we follow David Mackay, Information Theory, Inference and Learning Algorithms, and reuse code from a previous page, as well as some definitions cribbed from GHC libraries:
A simple strategy for lossless compression is to map each symbol to some nonempty string of bits \(\{0, 1\}^+\) and assign short codes to probable symbols and longer codes to improbable symbols, as Morse code does. Such a scheme is known as a symbol code:
codeword c x = maybe (error "bad symbol") id $ lookup x c
As before \(\mathcal{A}_X\) denotes the set of outcomes of \(X\).
We encode a message in \(\mathcal{A}_X^+\) with \(C\) via the extended code \(C^+\) which concatenates the codes of each symbol.
extendedCode c s = concatMap (codeword c) s
Unlike Morse code, there are no gaps between encoded symbols, so we must worry about ambiguity when there are multiple symbols.
A code is uniquely decodable if extendedCode
is injective.
One way to ensure a code is uniquely decodable is to stipulate that no codeword
is the prefix of another codeword, that is, the code is a prefix code. This
also simplifies decoding.
isPrefixCode c = not $ or [s `isPrefixOf` t | (x, s) <- c, (y, t) <- c, x /= y] -- This code is uniquely decodable, but not prefix: isPrefixCode [('a', "0"), ('b', "01")] extendedCode [('a', "0"), ('b', "01")] "abba"
The expected length \(L(C,X)\) for a symbol code \(C\) for a random variable \(X\) is the average encoding length:
\[ L(C,X) = \sum P(x) l(x) [x \in \mathcal{A}_X] \]
where \(l(x)\) is the length of \(x\) encoded using \(C\).
meanX :: (a -> Double -> Double) -> RandomVariable a -> Double meanX f rv = sum [p * f x p | (x, p) <- rv] expectedLength c rv = meanX (\x _ -> size $ codeword c x) rv
We seek to minimize the expected length.
Kraft-McMillan Inequality
Unique decodability imposes stringent limits on the lengths of our codewords. We can only shorten some codewords at the expense of lengthening others, a balancing act quantified by the following.
The Kraft-McMillan inequality: if \(C\) is a uniquely decodable code for \(X\) then
\[ \sum 2^{-l(x)} [x \in \mathcal{A}_X] \le 1 \]
kraftSum = sum . map (0.5^) kraftSum [2,3,1,3] kraftSum [2,3,1,2] -- No uniquely decodable code can have these lengths.
Moreover, given a set of lengths satisfying this inequality, we can construct a corresponding prefix code.
Proof: Define \(z(X) = \sum 2^{-l(x)} [x \in \mathcal{A}_X]\). We wish to show \(z(X) \le 1\).
Let \(N\) be a constant natural. Consider:
\[ z(X^N) = \sum 2^{-l(x)} [x \in \mathcal{A}_{X^N}] \]
We count this in two different ways. Firstly, by basic combinatorics, \(z(X^N) = z(X)^N\).
Secondly, if \(a(N, k)\) denotes the number of strings of length \(N\) with a \(k\)-bit encoding, then:
\[ z(X^N) = \sum 2^{-k} a(N, k) \]
Since \(C\) is uniquely decodable, \(a(N, k) \le 2^k\) for all \(k\). Furthermore, \(a(N, k) = 0\) for \(k > N l_\max\) where
\[ l_\max = \max \{ l(x) : x \in \mathcal{A}_X \} \]
Thus:
\[ \sum 2^{-k} a(N, k) \le N l_\max \]
Combining the two:
\[ z(X)^N = z(X^N) \le N l_\max \]
The right-hand side grows linearly with \(N\). Hence \(z(X) \le 1\), otherwise the left-hand side grows exponentially with \(N\) and must eventually exceed the right-hand side.
Conversely, suppose we are given lengths satisfying the Kraft inequality.
Sort them in increasing order and group together identical lengths, so that we have \(a_k\) copies of the length \(l_k\) and \(l_0 < … < l_n\).
Reset a binary counter of length \(l_0\). Pick its first \(a_0\) values to be codewords.
For \(0 < k \le n\), suppose we have already generated codewords of lengths \(l_i\) for all \(i < k\), and that we have a binary counter showing some value. Increment the counter, append \(l_k - l_{k-1}\) zero bits to the counter, then pick the next \(a_k\) values of the counter to be codewords.
fromKraft lengths = map reverse $ go 0 [] $ group $ sort lengths where go _ _ [] = [] go k ctr (ls@(l:_):rest) = xs ++ go l ctr' rest where (xs, ctr':_) = splitAt (length ls) $ iterate inc $ replicate (l - k) '0' ++ ctr inc [] = [] inc ('0':bs) = '1':bs inc ('1':bs) = '0':inc bs fromKraft [1,1] fromKraft [1,2,2] fromKraft [2,2,3,3,3,4,4]
The binary counter ensures our code is prefix, and the Kraft inequality guarantees the binary counter never overflows. For example:
\[ a_0 2^{-l_0} + a_1 2^{-l_1} + … + a_n 2^{-l_n} \le 1 \]
Mulitplying both sides by \(2^{l_0}\) yields:
\[ a_0 + a_1 2^{l_0 - l_1} + … + a_n 2^{l_0 - l_n} \le 2^{l_0} \]
Thus an \(l_0\)-bit counter will provide enough distinct values to get started. Mulitplying further by \(2^{l_1 - l_0}\):
\[ a_1 + a_2 2^{l_1-l_2} … + a_n 2^{l_1 - l_n} \le 2^{l_1} - a_0 2^{l_1 - l_0} \]
which means an \(l_1\)-bit counter can supply enough values for \(a_1\) codewords of length \(l_1\) even after accounting for those values forbidden by the \(a_0\) codewords of length \(l_0\). We induct to show this for all \(k\).
If we have equality then \(C\) is called a complete code.
Gibb’s inequality
Given two probability distributions \(P(x), Q(x)\) over the same alphabet \(\mathcal{A}_X\), the relative entropy or Kullback-Leibler divergence between them is:
\[ D_{KL}(P || Q) = \sum P(x) \lg \frac{P(x)}{Q(x)} \]
pr rv x = maybe (error "bad symbol") id $ lookup x rv relativeEntropy rvP rvQ = meanX (\x p -> lg $ p / pr rvQ x) rvP
Relative entropy is not necessarily symmetric. But it does satisfy:
Gibb’s inequality:
\[ D_{KL}(P || Q) \ge 0 \]
with equality if and only if \(P = Q\).
Proof: Apply Jensen’s inequality using the function \(f(x) = - \lg x\).
Jensen’s inequality: Let \(f\) be a convex function. Then \(E[f(X)] \ge f(E[X])\).
Limits of compression
Let \(z = \sum 2^{-l(x)} [x \in \mathcal{A}_X]\). The implicit probabilities defined by codelengths are given by \[ Q(x) = 2^{-l(x)} / z \] for all \(x \in \mathcal{A}_X\).
Then:
\[ L(C,X) = \sum P(x) l(x) = \sum -P(x) \lg Q(x) - \lg z \]
By Gibb’s inequality and Kraft’s inequality:
\[ L(C,X) \ge \sum -P(x) \lg P(x) - \lg z \ge H(X) \]
and \(L(C, X) = H(X)\) if and only if \(z = 1\) and \(l(x) = \lg 1/P(x) = h(x)\).
In other words, the expected length of any code is bounded below by \(H(X)\). We attain \(H(X)\) when \(P(x) = Q(x)\) for all \(x\).
Theorem: For any random variable \(X\) there exists a prefix code \(C\) such that:
\[ H(X) \le L(C,X) < H(X) + 1 \]
Proof: Set \(l(x) = \lceil h(x) \rceil\). Then
\[ \sum 2^{-l(x)} \le \sum 2^{-h(x)} = \sum P(x) = 1\]
Since the Kraft inequality is satisfied, a prefix code with these lengths exists. Also:
\[ L(C,X) = \sum P(x) \lceil h(x) \rceil < \sum P(x) (h(x) + 1) = H(X) + 1 \]
ceilingCode rv = zip (reverse $ fst <$> sortOn snd rv) ws where ws = fromKraft $ map (\(_, p) -> ceiling $ shannon p) rv ceilingCode english isPrefixCode $ ceilingCode english extendedCode (ceilingCode english) "someenglishwords"
We’ve seen that for lossy compression, we can achieve \(H(X)\) bits per symbol and no better. Lossless compression turns out to be similar: we can do no better than \(H(X)\) bits per symbol, and we can always do better than \(H(X) + 1\) bits per symbol.
Unlikely lossy compression, these bounds aren’t just theory. this algorithm, due to Shannon, is practical. However, it may not be optimal. For example, suppose the symbols A, B, C each have a 1/3 chance of occurring. Since \(\lceil \lg 3 \rceil = 2\), we’d assign a 2-bit codeword to each symbol, but we can easily do better by changing one of them to a 1-bit codeword. See also Fano’s method, which suffers from a similar affliction.]
uniform3 = zip "ABC" $ repeat (1/3) ceilingCode uniform3
Happily there is a better algorithm that always produces an optimal symbol code.
Huffman
What can we say about an optimal prefix code for \(X\)? Namely, a prefix code \(C\) that minimizes \(L(C,X)\)?
If there is only one symbol then finding an optimal code is trivial, so assume at least two symbols. Take the least probable symbol \(x_0\), breaking any ties arbitrarily. To minimize \(L(C,X)\), we must give \(x_0\) one of the longest codewords, that is, the codeword \(c\) for \(x_0\) must have length:
\[ l_\max = \max \{ l(x) : x \in \mathcal{A}_X \} \]
If we remove the last bit of \(c\), we must have the prefix of some other codeword of \(C\), otherwise we could do without the last bit to get a better code than \(C\); a contradiction. Since \(c\) has length \(l_\max\), this other codeword must be \(c\) with its last bit flipped.
Label these two codewords \(c_0, c_1\), and let \(c'\) be their common prefix. Take the second least probable symbol \(x_1\), again breaking any ties arbitrarily. If \(x_0, x_1\) do not already correspond to \(c_0, c_1\), we swap codewords so they do. This preserves \(L(C,X)\) because \(C\) is optimal; all such swaps must involve codewords of length \(l_\max\).
Consider the random variable \(X'\) that is identical to \(X\) except that instead of the symbols \(x_0\) and \(x_1\) with probabilities \(p_0\) and \(p_1\), we have a single symbol \(x'\) with probability \(p_0 + p_1\). We may encode \(X'\) with a code \(C'\) that is identical to \(C\) except we use \(c'\) to encode \(x'\). Then \(L(C,X) = L(C',X') + p_0 + p_1\).
The code \(C'\) is also optimal. For suppose \(L(D',X') < L(C',X')\) for some code \(D'\). Then let \(d'\) be the encoding of \(x'\) under \(D'\). If we map \(x_0\) and \(x_1\) to \(d'\) followed by one bit, we have a code \(D\) for \(X\) satisfying:
\[ L(D,X) = L(D',X') + p_0 + p_1 < L(C',X') + p_0 + p_1 = L(C,X) \]
which is impossible since \(C\) is optimal.
These observations lead to Huffman coding: an algorithm that produces an optimal coding. Starting with the symbols of \(X\), repeatedly take the two most improbable symbols, breaking ties arbitrarily, label them 0 and 1, then treat them as a single symbol with their combined probability.
Eventually, only one symbol remains. The codeword for a symbol of \(X\) can be recovered by reading the labels as we retrace our footsteps. This algorithm is usually described as a bottom-up construction of a binary tree, which is reflected in our Haskell.
huffman rv = go $ map (\(x, p) -> (Node x [], p)) rv where go [(root, _)] = walk id root go ts = go $ (Node undefined [a, b], p + q):rest where ([(a, p), (b, q)], rest) = splitAt 2 $ sortOn snd ts walk pre (Node x []) = [(x, pre [])] walk pre (Node _ [a, b]) = walk (pre . ('0':)) a ++ walk (pre . ('1':)) b huffman uniform3 huffman english
If we insist that all codewords have the same length, then the Latin alphabet
needs 5 bits per symbol; see the Baudot code and Bacon’s cipher.
For the english
distribution, both Shannon and Huffman codes have less than 5
expected bits per symbol, though Huffman gets much closer to the entropy, which
is the theoretical limit.
expectedLength (ceilingCode english) english expectedLength (huffman english) english entropy english
Symbol codes fare poorly on lopsided distributions like the following. The expected length is about 90 times larger than the entropy!
lop = zip "ab" [0.001,0.999] huffman lop expectedLength (huffman lop) lop entropy lop
Schemes like run-length encoding would perform well here, but are forbidden when each symbol must be encoded individually.
Nonetheless, Huffman coding is easy to implement and often effective in
practice. We demonstrate on some randomly generated english
strings:
demo n rv showouts = do x <- replicateM n $ sample rv <$> jsrandom putStrLn $ "message: " ++ x let outS = extendedCode (ceilingCode rv) x outH = extendedCode (huffman rv) x perN = (/ fromIntegral n :: Double) putStrLn $ "Shannon: bits per symbol = " ++ show (perN $ size outS) when showouts $ putStrLn outS putStrLn $ "Huffman: bits per symbol = " ++ show (perN $ size outH) when showouts $ putStrLn outH putStrLn "" demo 50 english True replicateM 3 $ demo 50 english False