A 0 B 10 C 11
Stream Codes
Symbol codes are tyrannized by the powers of 2. Huffman codes approach optimal compression only when the Shannon information contents of each symbol happen to be close to powers of 2.
Symbol codes can be viewed as partitioning the unit interval into intervals whose sizes are all negative powers of 2 where the lower endpoint is closed and the higher endpoint is open. We assign intervals to symbols. The more probable a symbol, the larger its interval should be, and the fewer the bits needed to represent it. Furthermore, after decoding a symbol, we can view the next bits as partitioning the symbol’s interval in the same proportions that divide the unit interval.
If we imagine prefixing "0." to an encoded message, then we have the lower endpoint of the relevant interval in binary, and the number of significant figures determines the size of the interval. For example, consider the symbol code:
Then we prefix "0." to the codes to find the intervals they represent:
A [0.0, 0.1) B [0.10, 0.11) C [0.11, 1.00)
As decimal fractions, these are [0, 1/2), [1/2,3/4), and [3/4,1).
Continuing our example, "AA" encodes to "00" and "BC" encodes to "1011". These correspond to binary numbers "0.00" and "0.1011". The first represents the interval [0.00, 0.01), and the second is [0.1011, 0.1100).
Let’s call intervals like these Huffman intervals. (I don’t know of any standard names for them.)
Beyond binary
What if we generalize to arbitrary intervals? Then we get arithmetic coding, which turns out to yield (near-)optimal stream codes for any probability distribution.
For recording a sequence of moves in rock-paper-scissors, assuming a uniform distribution, it would be optimal to use base 3, namely divide the unit interval into thirds, then divide each of those into thirds, and so on.
More generally, the probability distribution of an information source would describe exactly how to partition the unit interval, and lead to an optimal encoding.
This time, We represent probabilities with arbitrary precision rationals.
type RandomVariable a = [(a, Rational)]
We represent an interval with its lower endpoint and its length. The following converts probabilities to intervals.
intervals :: RandomVariable a -> [(a, (Rational, Rational))] intervals = tail . scanl accum (undefined, (0, 0)) where accum (_, (lo, d)) (x, p) = (x, (lo + d, p)) uniform3 = zip "ABC" $ repeat (1/3) intervals uniform3
Let’s code! Roughly speaking, we start with the unit interval, and as we encode symbols, we zoom in on sub-intervals of our current interval. See the Dasher zooming interface for a visualization.
zoom :: Eq a => RandomVariable a -> [a] -> (Rational, Rational) zoom rv = go (0, 1) where tab = intervals rv go (lo, d) = \case sym : rest | Just (lo', d') <- lookup sym tab -> go (lo + d * lo', d * d') rest [] -> (lo, d) zoom uniform3 "A" zoom uniform3 "AA" zoom uniform3 "AB" zoom uniform3 "ABBACACABBA"
Mission accomplished? No, it’s hardly a stream code if we analyze the entire message to encode it. Also, for practical reasons, we wish to output 0s and 1s (though see range encoding).
Since we ought to produce output as we go, we convert what we can to binary on the fly. When both endpoints lie in the same half of the interval, we output one bit then double everything.
We finish by sending enough bits to describe a Huffman interval lying within the final interval.
bitChar b = chr $ ord '0' + b encRat tab (lo, d) syms | b == 1 || lo + d < 1%2 = bitChar b : encRat tab (2*lo - fromIntegral b, 2*d) syms | sym : rest <- syms, Just (lo', d') <- lookup sym tab = encRat tab (lo + d * lo', d * d') rest | d == 1 = [] | otherwise = finish 0 where b = fromEnum $ lo >= 1%2 finish k | 1%2 + 1%2^(k + 1) <= lo + d = '1' : replicate k '0' | 1%2 - 1%2^(k + 1) >= lo = '0' : replicate k '1' | otherwise = finish $ k + 1 encodeRat :: Eq a => RandomVariable a -> [a] -> String encodeRat rv = encRat (intervals rv) (0, 1) encodeRat uniform3 "A" encodeRat uniform3 "ABBACACABBA"
Decoding inverts this process.
decodeRat :: RandomVariable a -> String -> [a] decodeRat rv = go (0, 1) where tab = intervals rv go (lo, d) bs | Just (c, (lo', d')) <- find f tab = c : go ((lo - lo')/d', d/d') bs | b : rest <- bs = go (if b == '1' then lo + d/2 else lo, d/2) rest | otherwise = [] where f (_, (lo', d')) = lo' <= lo && lo + d <= lo' + d' decodeRat uniform3 "00" decodeRat uniform3 "0010100001001001000"
Mission accomplished for real? Almost. There are a few loose ends:
-
Suppose an interval only takes a few ternary digits (trits) to represent. Does this always imply it only takes a few bits to represent a Huffman interval lying within? More generally, given an interval represented by rationals, do we waste space when converting to binary to describe some Huffman interval it contains?
-
We represent the final interval with a Huffman interval within. But what if this Huffman interval is so small that it also lies within some subinterval that could be interpreted as the final interval? It could be unclear when the message ends.
-
The precision of our computations grows with the message. Can we limit the precision?
Let’s start with the good news. Let \(I\) be the interval \([0, m)\). Then \(I\) contains \([0, 1/2^n)\) where \(n = \lceil \lg 1/m \rceil\). If we translate \(I\), sometimes we find it contains a Huffman interval of the form \([k/2^n, (k+1)/2^n)\) which only takes \(n\) bits to represent.
But sometimes not, because it straddles a point \(j/2^n\) for some \(j\). Still, at least one of the half-intervals on either side lies in \(I\), that is, the interval \(I\) contains at least one of \([2j / 2^{n+1}, (2j + 1) / 2^{n+1})\) or \([(2j - 1)/2^{n+1}, 2j / 2^{n+1})\), otherwise \(1/2^n\) would exceed \(m\), a contradiction.
In other words, for any interval of length \(m\), we can find a Huffman interval within that we can represent with at most \(\lceil \lg 1/m \rceil + 1\) bits. Thus, the entire encoded message has under 2 bits of overhead.
Now the bad news. In general, our decoder cannot determine when to stop decoding. Consider a biased coin that shows heads 9 times out of 10.
foulCoin = [('H', 9 % 10), ('T', (1 % 10))] intervals foulCoin
We encode heads with [0, 9/10) and tails with [9/10,1), which is optimal. But what do we make of the encoded message "0"? It means \([0,1/2)\), which is contained within:
-
\([0,9/10)\), which decodes to "H"
-
\([0,(9/10)^2)\), which decodes to "HH"
…
-
\([0,(9/10)^6)\), which decodes to "HHHHHH".
There is no way around this. We must send the message length separately, or add a special end-of-message symbol and assign it some probability.
Our decoder above always chooses the longest possibility.
decodeRat foulCoin "0"
Adaptive encoding
Let’s tackle the third concern: arbitrary precision.
We pick some power of two \(w\), and round our probabilities to the nearest rational whose denominator is \(w\). This is harmless for sufficiently large \(w\), as a tiny difference in probabilities has a negligible effect on compression, and we can now represent endpoints with integers. They are the numerators of fractions with denominator \(w\).
We’ve been sending off the most-significant bits when we can. What if we also lop off least-significant bits? That is, during encoding, we simply drop the low bits so that numbers representing the endpoints stay under \(w\)?
This amounts to perturbing the probability distribution as we go; we can view fixed precision as nudging probabilities so numbers happen to line up with certain powers of two, so we only ever drop zero bits. Again, for large enough \(w\), the tiny changes to the endpoints are harmless provided the interval isn’t too small.
Changing probabilities on the fly is known as adaptive encoding. Apart from letting us get away with fixed precision, adaptive encoding can also improve compression by incorporating more facts about our information source.
For example, in English,"q" is almost always followed by "u", so after encoding a "q" we ought to give "u" a heavy weight in our probability distribution for the next letter. Naturally, the decoder must make the same assumptions.
We can also apply adaptive encoding to symbol codes, but the gains are more modest.
Meet Me Halfway
The last of our loose ends has a loose end. We said that we’re fine so long as "the interval isn’t too small". But just how small can intervals get? Do they get so small that dropping the least-significant bits would cause disaster?
Unfortunately, yes. Consider the stream code:
A [0, 1/4) B [1/4, 3/4) C [3/4, 1)
A long string of "B"s leads to an interval that shrinks around 1/2. If eventually followed by an "A", then the encoding should be 0111…, that is, something just below 1/2. On the other hand, if eventually followed by a "C", then the encoding should be 1000…, that is, something just above 1/2.
dilemma = [('A', 1%4), ('B', 1%2), ('C', 1%4)] encodeRat dilemma "BBBBBBBBBBBBBBBA" encodeRat dilemma "BBBBBBBBBBBBBBBC"
Until then, the encoder must stay silent, yet as the run of "B"s grows, the straddling interval shrinks, seemingly dooming us to arbitrary precision. See the table-maker’s dilemma, a related problem.
Fortunately, hugging 1/2 is the sole pitfall, for whenever the interval lies to one side, we can send off the most significant bit. Even more fortunately, there is a simple fix. If a straddling interval lies in [1/4, 3/4), we widen it by doubling each endpoint’s distance from 1/2 and increment a counter. When the interval finally chooses a side, we output 011..1 or 100..0 accordingly, where the number of bits is equal to the counter. We then reset the counter.
Thus the interval is always greater than 1/4 in size, and we avoid underflow if we compute with just 2 more bits than the precision used to describe probabilities in the distribution.
(Symbol codes sidestep this problem because their intervals never straddle 1/2.)
Fixed-precision arithmetic coding
We demonstrate with 16-bit precision, though for clarity we use mathematical operations rather than bitwise operations. Our implementation actually involves 32-bit operations.
wordSz = 16
We define constants representing the points that divide the unit interval into quarters:
q1, q2, q3, q4 :: Word q1 = 2^(wordSz - 2) q2 = 2^(wordSz - 1) q3 = q1 + q2 q4 = 2^wordSz
We quantize the probability intervals to the form \([a/2^{16}, b/2^{16})\). As
above, we store them as the lower endpoint and interval size. Since the
denominator is always q4
, we just store \((a, b - a)\).
quantize :: [(a, (Rational, Rational))] -> [(a, (Word, Word))] quantize tab = map (\(c, (x, y)) -> (c, (circa x, underCk $ circa y))) tab where circa r = floorRational (fromIntegral q4 * r) floorRational q = fromIntegral $ numerator q `div` denominator q underCk y | y < 4 = error "potential underflow" | otherwise = y
We’ve conservatively taken the floor to approximate rationals, which means we
may leave a sliver of the unit interval unused, and the approximations may be a
touch worse than they ought to be. We could do better by calling round
instead of floor
and forcing the last interval to (a, q4 - a)
.
The encoder resembles our arbitrary precision version, except for a counter and an extra case for intervals contained in [1/4, 3/4):
enc :: Eq a => Int -> [(a, (Word, Word))] -> (Word, Word) -> [a] -> String enc ctr tab (lo, d) syms | lo >= q1 && lo + d < q3 = enc (ctr + 1) tab (2*lo - q2, d*2) syms | b == 1 || lo + d < q2 = bitChar b : replicate ctr (bitChar $ 1 - b) ++ enc 0 tab (2*(lo - fromIntegral b*q2), 2*d) syms | sym : rest <- syms, Just (lo', d') <- lookup sym tab = enc ctr tab (lo + lo' * d `div` q4, d' * d `div` q4) rest | d == q4 = if ctr == 0 then [] else '0' : replicate ctr '1' | otherwise = finish 0 where b = fromEnum $ lo >= q2 finish k | q2 + 2^(wordSz - k - 1) <= lo + d = '1' : replicate (ctr + k) '0' | q2 - 2^(wordSz - k - 1) >= lo = '0' : replicate (ctr + k) '1' | otherwise = finish $ k + 1 encode :: Eq a => RandomVariable a -> [a] -> String encode rv = enc 0 (quantize $ intervals rv) (0, q4)
We take pains to ensure the decoder recreates the encoder’s state of mind exactly, changing the probability distribution in lockstep. An earlier version confused a strict inequality for one that wasn’t, resulting in erroneous decodings, but only rarely!
The tuple (a, p)
represents the interval [a, a + 2^p)
and the lower p
bits of a
are always zero.
dec :: [(a, (Word, Word))] -> (Word, Word) -> (Word, Int) -> String -> [a] dec tab (lo, d) (a, p) bs | lo >= q1 && lo + d < q3 = dec tab (2*lo - q2, 2*d) ((q2 + 2*a) `mod` q4, p + 1) bs | b == 1 || lo + d < q2 = dec tab (2*(lo - fromIntegral b*q2), 2*d) (2*a `mod` q4, p + 1) bs | Just (c, (lo', d')) <- asum $ try <$> tab = c : dec tab (lo', d') (a, p) bs | h : rest <- bs = dec tab (lo, d) (if h == '1' then a + 2^(p - 1) else a, p - 1) rest | otherwise = [] where b = fromEnum $ lo >= q2 try (c, (loX, dX)) = let lo' = lo + loX * d `div` q4 d' = dX * d `div` q4 in if lo' <= a && a + 2^p <= lo' + d' then Just (c, (lo', d')) else Nothing decode :: RandomVariable -> String -> [a] decode rv = dec (quantize $ intervals rv) (0, q4) (0, wordSz)
Mission accomplished. Our arithmetic coding demo attains the optimal expected length \(H N\) save for under 2 bits of overhead for binary encoding and the cost of either transmitting the message length or adding an end-of-message symbol with a nonzero probability.
Lempel-Ziv
We explore a stream code of a different nature.
We need a function that splits a list into two in all possible ways:
-- splits = zip <$> inits <*> tails splits xs = case xs of [] -> [([], [])] x:xt -> ([], xs) : map (first (x:)) (splits xt) splits ['a'..'g']
We chunk the input by repeatedly breaking off the shortest prefix we haven’t seen yet.
uPre seen xs = case dropWhile ((`elem` seen) . fst) $ splits xs of (as, bs) : _ -> as : uPre (as : seen) bs [] -> [] twister = "peterpiperpickedapeckofpickledpeppers" uPre [] twister
We index these unique prefixes:
putStr $ unlines $ zipWith (\n s -> shows n $ ' ':s) [0..] $ uPre [] twister
How does this help? If we replace each chunk with its index, we get
[0,1,2,…]
, which is clearly impossible to decode!
But what if for each chunk, we drop the last symbol before looking up its index, and also send the symbol that was dropped?
The first chunk is always empty so we skip it.
lzIdea xs = send <$> tail us where send u = (maybe undefined id $ lookup (init u) tab, last u) us = uPre [] xs tab = zip us [0..] lzIdea twister
This works! A decoder can build the same table as it proceeds: by inductive assumption our table contains all chunks seen so far, thus we can always correctly update the table from an index and a new symbol.
It remains to specify how the indexes are encoded in binary. Since the sender and receiver have the same table, they both know the minimum number of bits \(m\) required to represent the current largest index. Altogether, we have:
lempelZiv s = go 0 [] 0 [] s [] where go m tab i acc = \case b:bt -> case lookup s tab of Just j -> go m tab j s bt Nothing -> showsBin m i . (b:) . go m' ((s, k):tab) 0 [] bt where s = b:acc -- Represent a prefix with its reversal. k = length tab + 1 m' = m + fromEnum (2^m == k) [] | i == 0 -> id | otherwise -> showsBin m i showsBin m k | m == 0 = id | otherwise = showsBin (m - 1) q . (bitChar r:) where (q, r) = divMod k 2 unlempelZiv s = go 0 [(0, "")] s [] where go m tab = \case [] -> id xs -> let (bs, t) = splitAt m xs v = maybe undefined (++) $ lookup (unBin bs) tab in case t of new:rest -> v . (new:) . go m' tab' rest where k = length tab m' = m + fromEnum (2^m == k) tab' = (k, v [new]):tab [] -> v unBin = foldl (\n b -> 2*n + fromEnum (b == '1')) 0 lempelZiv "1011010100010" unlempelZiv $ lempelZiv "1011010100010" -- Mackay, Exerises 6.5 and 6.6. lempelZiv "000000000000100000000000" unlempelZiv "00101011101100100100011010101000011"
The examples are chosen so the last string is absent from the table. We can force this to happen by using a special symbol to mark the end of the message. Or we can finish with the index of some string whose prefix matches the final string. Our code does the latter.
Lempel-Ziv compression is asymptotically optimal but only on a galactic scale (for any ergodic source, we eventually approach the entropy of the source, but typically not before we run out of space and time). Indeed, we see the Lempel-Ziv compression often produces output longer than the input for small examples, though we can easily optimize slightly:
-
After adding all possibilities to a prefix, the original prefix is never needed again, so its index can be freed. (Taking advantage of this may require tweaking how we deal with the end of the message.)
-
In the case of bits, the second time we append a bit to the same prefix, there is no need to send it because its identity is known.
Nonetheless, in practice, Lempel-Ziv is effective, because real data often contains many copies of the same sequence. Also, Lempel-Ziv is easy to implement it just works. There’s no need to model the data; no need for finding relative symbol frequencies; no need for thinking about how earlier symbols affect the probabilities of future symbols. Fire and forget.
With Huffman or arithmetic coding, we can automate modeling by, say, updating a probability distribution as we encode. But it’s more work.
In my younger days, Lempel-Ziv dominated compression. Many DOS executables I
ran were compressed with LZEXE, so they’d
decompress themselves in memory before running. GIF files and other widespread
media formats relied on a variant of Lempel-Ziv. So did ZIP archives, and its
competitors like ARC. On Unix, the compress
and gzip
tools also use
Lempel-Ziv.
Then in the 1990s, the
the
Burrows-Wheeler transform was discovered, which can be viewed a stream code if
we think of the alphabet as consisting of giant blocks. As the popularity of
bzip2
attests, this algorithm seems as good as Lempel-Ziv for detecting
patterns like the same string appearing several times in a block, though other
algorithms such as arithmetic coding are required to actually compress the
data.
Showdown
We reuse code from previous pages:
englishSrc = unlines [ "a 8167" , "b 1492" , "c 2782" , "d 4253" , "e 12702" , "f 2228" , "g 2015" , "h 6094" , "i 6966" , "j 0153" , "k 0772" , "l 4025" , "m 2406" , "n 6749" , "o 7507" , "p 1929" , "q 0095" , "r 5987" , "s 6327" , "t 9056" , "u 2758" , "v 0978" , "w 2360" , "x 0150" , "y 1974" , "z 0077" ] parse :: String -> RandomVariable Char parse s = second divTotal <$> tab where tab = [(c, fromIntegral $ readInteger s) | [[c], s] <- words <$> lines s] divTotal = (/ sum (snd <$> tab)) english = parse englishSrc cumu rv = scanl1 (\(_,p) (b,q) -> (b,p + q)) $ sortOn snd rv sample rv r = fst $ head $ dropWhile ((< r) . fromRational . snd) $ cumu rv parseFrac = uncurry (/) . foldl (\(a, b) d -> (a*10 + fromIntegral (ord d - ord '0'), b*10)) (0.0, 1) jsrandom = parseFrac . drop 2 <$> jsEval "Math.random()" sortOn f = \case [] -> [] x:xt -> sortOn f (filter ((<= fx) . f) xt) ++ [x] ++ sortOn f (filter ((> fx) . f) xt) where fx = f x sort = sortOn id data Tree a = Node { rootLabel :: a, subForest :: [Tree a] } huffman rv = go $ map (\(x, p) -> (Node x [], p)) rv where go [(root, _)] = walk id root go ts = go $ (Node undefined [a, b], p + q):rest where ([(a, p), (b, q)], rest) = splitAt 2 $ sortOn snd ts walk pre (Node x []) = [(x, pre [])] walk pre (Node _ [a, b]) = walk (pre . ('0':)) a ++ walk (pre . ('1':)) b codeword c x = maybe (error "bad symbol") id $ lookup x c extendedCode c s = concatMap (codeword c) s
Huffman coding already works well on our running english
example, and
beats arithmetic coding on particular strings by chance.
For the latter, we show the decoding, as it may show too many symbols because
we have not correctly handled the end of the message.
Lempel-Ziv does poorly because repeated substrings are unlikely, and because we replace each letter with 5 bits (like a Baudot code, or Bacon’s cipher) to imitate how compression tools are often run on ASCII files. But we would see savings on realistic text, and we stress that we can do without a model of the data source.
bps s = fromIntegral (length s) / 50.0 fiver c = replicateM 5 "01" !! (ord c - ord 'a') do let rv = english x <- replicateM 50 $ sample rv <$> jsrandom let outH = extendedCode (huffman rv) x outA = encode rv x outLZ = lempelZiv $ fiver =<< x putStrLn x putStrLn $ "\nHuffman: " ++ show (bps outH) ++ " bits per symbol" putStrLn outH putStrLn $ "\narithmetic: = " ++ show (bps outA) ++ " bits per symbol" putStrLn outA putStrLn $ "decoded:\n" ++ decode rv outA putStrLn $ "\nLempel-Ziv: " ++ show (bps outLZ) ++ " bits per symbol" putStrLn $ lempelZiv x putStrLn outLZ
Arithmetic coding shines with lopsided distributions, as any symbol code requires 1 bit per symbol at minimum. Lempel-Ziv also usually beats symbol codes, because it can take advantage of repetitions.
Again, the comparison is a little unfair because our arithmetic coding implementation may mess up the end of the message. But the longer the message, the less this matters.
significoin = [('H',19/20), ('T',1/20)] do let rv = significoin x <- replicateM 50 $ sample significoin <$> jsrandom let outH = extendedCode (huffman rv) x outA = encode rv x outLZ = lempelZiv x putStrLn x putStrLn $ "\nHuffman: " ++ show (bps outH) ++ " bits per symbol" putStrLn outH putStrLn $ "\narithmetic: = " ++ show (bps outA) ++ " bits per symbol" putStrLn outA putStrLn $ "decoded:\n" ++ decode rv outA putStrLn $ "\nLempel-Ziv: " ++ show (bps outLZ) ++ " bits per symbol" putStrLn outLZ