Stream Codes

{-# LANGUAGE CPP #-}
#ifdef __HASTE__
import Haste.DOM
import Haste.Events
import Control.Monad (void)
#endif
import Data.Foldable (asum)
import Data.List (genericLength, find)
import Data.Maybe (catMaybes)
import Data.Ratio
import System.Random

Symbol codes are tyrannized by the powers of 2. Huffman codes approach optimal compression only when the Shannon information contents of each symbol happen to be close to powers of 2.

Arithmetic coding liberates us from binary, and for any probability distribution, leads to optimal stream codes.

Symbol codes can be viewed as partitioning the unit interval into intervals whose sizes are all negative powers of 2 where the lower endpoint is closed and the higher endpoint is open. We assign intervals to symbols. The more probable a symbol, the larger its interval should be, and the fewer the bits needed to represent it. Furthermore, after decoding a symbol, we can view the next bits as partitioning the symbol’s interval in the same proportions we used on the unit interval.

If we imagine prefixing "0." to an encoded message, then we have the lower endpoint of the relevant interval in binary, and the number of significant figures determines the size of the interval.

For example, consider the symbol code:

A 0
B 10
C 11

Then we prefix "0." to the codes to find the intervals they represent:

A [0.0, 0.1)
B [0.10, 0.11)
C [0.11, 1.00)

As decimal fractions, these are [0, 1/2), [1/2,3/4), and [3/4,1).

Continuing our example, "AA" encodes to "00" and "BC" encodes to "1011". These correspond to binary numbers "0.00" and "0.1011". The first represents the interval [0.00, 0.01), and the second is [0.1011, 0.1100).

Beyond binary

What if we allow intervals of arbitrary sizes? For recording a sequence of moves in rock-paper-scissors, assuming a uniform distribution, it would be optimal to use base 3, namely divide the unit interval into thirds, then divide each of those into thirds, and so on.

More generally, the probability distribution of an information source would describe exactly how to partition the unit interval, and lead to an optimal encoding.

We represent probabilities with arbitrary precision rationals.

type RandomVariable a = [(a, Rational)]

We represent an interval with its lower endpoint and its length. The following converts probabilities to intervals.

intervals :: RandomVariable a -> [(a, (Rational, Rational))]
intervals rv = tail $ scanl accum (undefined, (0, 0)) rv where
  accum (_, (lo, d)) (x, p) = (x, (lo + d, p))

Let’s code! Roughly speaking, we start with the unit interval, and as we encode symbols, we zoom in on sub-intervals of our current interval.

zoom :: Eq a => [(a, (Rational, Rational))] -> [a] -> (Rational, Rational)
zoom tab = go (0, 1) where
  go (lo, d) syms
    | sym : rest <- syms, Just (lo', d') <- lookup sym tab =
      go (lo + d * lo', d * d') rest
    | otherwise = (lo, d)

Mission accomplished? No, it’s hardly a stream code if we analyze the entire message to encode it. Also, for practical reasons, we wish to output 0s and 1s (though see range encoding).

Since we ought to produce output as we go, we convert what we can to binary on the fly. When both endpoints lie in the same half of the interval, we output one bit then double everything.

We finish by sending enough bits to describe an interval lying within the final interval.

encRat tab (lo, d) syms
  | b || lo + d <= 1%2 = b :
    encRat tab (if b then 2*lo - 1 else 2*lo, 2*d) syms
  | sym : rest <- syms, Just (lo', d') <- lookup sym tab =
    encRat tab (lo + d * lo', d * d') rest
  | d == 1 = []
  | otherwise = finish 0
  where
  b = lo >= 1%2
  finish k
    | 1%2 + 1%2^(k + 1) <= lo + d = True : replicate k False
    | 1%2 - 1%2^(k + 1) >= lo = False : replicate k True
    | otherwise = finish $ k + 1

encodeRat :: Eq a => [(a, (Rational, Rational))] -> [a] -> [Bool]
encodeRat tab = encRat tab (0, 1)

Decoding inverts this process.

decodeRat :: [(a, (Rational, Rational))] -> [Bool] -> [a]
decodeRat tab = go (0, 1) where
  go (lo, d) bs
    | Just (c, (lo', d')) <- find f tab = c : go ((lo - lo')/d', d/d') bs
    | b : rest <- bs = go (if b then lo + d/2 else lo, d/2) rest
    | otherwise = []
    where f (_, (lo', d')) = lo' <= lo && lo + d <= lo' + d'

This seems to be an elegant generalization of symbol codes, but on closer inspection, there is cause for concern:

Suppose an interval only takes a few ternary digits (trits) to represent. Does approximating the interval in binary require far more bits than we’d like? What about other bases?
With symbol codes, the intervals are exact so the decoder knows when the message ends. However, our scheme uses approximations. Given an approximation, how do we know if it’s a coarse approximation to a big interval, or a finer approximation of some subinterval within?
The precision of our computations grows with the message. Can we limit the precision?

Let’s start with the good news. Let \(I\) be the interval \([0, m)\). Then \(I\) contains \([0, 1/2^n)\) where \(n = \lceil \lg 1/m \rceil\). If we translate \(I\), sometimes we find it contains an approximation of the form \([k/2^n, (k+1)/2^n)\), but sometimes not, because it straddles a point \(j/2^n\) for some \(j\).

But then at least one of the half-intervals on either side lies in \(I\), that is, the interval \(I\) contains at least one of \([2j / 2^{n+1}, (2j + 1) / 2^{n+1})\) or \([(2j - 1)/2^{n+1}, 2j / 2^{n+1})\), otherwise \(1/2^n\) would exceed \(m\), a contradiction.

In other words, we can approximate any interval of length \(m\) with at most \(\lceil \lg 1/m \rceil + 1\) bits. Thus, the entire encoded message has under 2 bits of overhead.

Now the bad news. In general, our decoder cannot determine when to stop decoding. Consider a biased coin that shows heads 9 times out of 10. We encode heads with [0, 9/10) and tails with [9/10,1), which is optimal. But what do we make of the encoded message "0"? It means \([0,1/2)\), which is contained within:

\([0,9/10)\), which decodes to "H"
\([0,(9/10)^2)\), which decodes to "HH"

…

\([0,(9/10)^6)\), which decodes to "HHHHHH".

There is no way to recover the original message. We must send its length separately, or add a special end-of-message symbol and assign it some probability.

Our decoder above always chooses the longest possibility.

Adaptive encoding

Let’s tackle arbitrary precision. We pick some power of two \(w\), and round our probabilities to the nearest rational whose denominator is \(w\). This is harmless for sufficiently large \(w\), as a tiny difference in probabilities has a negligible effect on compression, and we can now represent endpoints with integers. They are the numerators of fractions with denominator \(w\).

We’ve been sending off the most-significant bits when we can. What if we also lop off least-significant bits? That is, during encoding, we simply drop the low bits so that numbers representing the endpoints can stay under \(w\)?

This amounts to perturbing the probability distribution as we go; we can view fixed precision as nudging probabilities so numbers happen to line up with certain powers of two, so we only ever drop zero bits. Again, for large enough \(w\), the tiny changes to the probabilities have a negligible effect on compression.

Changing probabilities on the fly is known as adaptive encoding. Apart from letting us get away with fixed precision, adaptive encoding can also improve compression by incorporating more facts about our information source.

For example, in English,"q" is almost always followed by "u", so after encoding a "q" we ought to give "u" a heavy weight in our probability distribution for the next letter. Naturally, the decoder must make the same assumptions.

We can also apply adaptive encoding to symbol codes, but the gains are more modest.

Halfway there

However, fixed precision raises a new concern. When dropping the least-significant bits, can we drop too much and lose information?

Unfortunately, we can. Consider the stream code:

A [0, 1/4)
B [1/4, 3/4)
C [3/4, 1)

A long string of "B"s leads to an interval that shrinks around 1/2. If eventually followed by an "A", then the encoding should be 0111…, that is, something just below 1/2. On the other hand, if eventually followed by a "C", then the encoding should be 1000…, that is, something just above 1/2. Until then, a naive encoder must stay silent, and if there are too many "B"s, something will go wrong. See the table-maker’s dilemma, a related problem.

Hugging 1/2 is the sole pitfall, for whenever the interval lies to one side, we can send off the most significant bit. Fortunately there is a simple fix. If the interval lies in [1/4, 3/4), we double it (more precisely, we widen it by doubling each endpoint’s distance from 1/2) and increment a counter. When the interval finally chooses a side, we output 011..1 or 100..0 accordingly, where the number of bits is equal to the counter. We then reset the counter.

Thus the interval is always greater than 1/4 in size, and we avoid underflow if we compute with just 2 more bits than the precision used to describe probabilities in the distribution.

(Symbol codes sidestep this problem by construction: their intervals never straddle 1/2.)

Fixed-precision arithmetic coding

We demonstrate with 16-bit precision, though for clarity we use mathematical operations rather than bitwise operations. Our implementation actually involves 32-bit operations.

wordSz = 16

We define constants representing the points that divide the unit interval into quarters:

q1, q2, q3, q4 :: Word
q1 = 2^(wordSz - 2)
q2 = 2^(wordSz - 1)
q3 = q1 + q2
q4 = 2^wordSz

We convert the probability intervals to the form \([a/2^{16}, b/2^{16})\). As above, we store them as the lower endpoint and interval size. Since the denominator is always q4, we just store \((a, b - a)\).

digitize :: [(a, (Rational, Rational))] -> [(a, (Word, Word))]
digitize tab = map (\(c, (x, y)) -> (c, (circa x, underCk $ circa y))) tab
  where
  circa r = floor (fromIntegral q4 * r)
  underCk y
    | y < 4 = error "potential underflow"
    | otherwise = y

We’ve conservatively taken the floor to approximate rationals, which means we may leave a sliver of the unit interval unused, and the approximations may be a touch worse than they ought to be. We could do better by calling round instead of floor and forcing the last interval to (a, q4 - a).

The encoder resembles our arbitrary precision version, except for a counter and an extra case for intervals contained in [1/4, 3/4):

enc :: Eq a => Int -> [(a, (Word, Word))] -> (Word, Word) -> [a] -> [Bool]
enc ctr tab (lo, d) syms
  | lo >= q1 && lo + d <= q3 = enc (ctr + 1) tab (2*lo - q2, d*2) syms
  | b || lo + d <= q2 = b : replicate ctr (not b)
    ++ enc 0 tab (if b then 2*(lo - q2) else 2*lo, 2*d) syms
  | sym : rest <- syms, Just (lo', d') <- lookup sym tab =
    enc ctr tab (lo + lo' * d `div` q4, d' * d `div` q4) rest
  | d == q4 = if ctr == 0 then [] else False : replicate ctr True
  | otherwise = finish 0
  where
  b = lo >= q2
  finish k
    | q2 + 2^(wordSz - k - 1) <= lo + d = True : replicate (ctr + k) False
    | q2 - 2^(wordSz - k - 1) >= lo = False : replicate (ctr + k) True
    | otherwise = finish $ k + 1

encode :: Eq a => [(a, (Rational, Rational))] -> [a] -> [Bool]
encode tab = enc 0 (digitize tab) (0, q4)

We take pains to ensure the decoder recreates the encoder’s state of mind exactly, changing the probability distribution in lockstep.

The tuple (a, p) represents the interval [a, a + 2^p) and the lower p bits of a are always zero.

dec :: [(a, (Word, Word))] -> (Word, Word) -> (Word, Int) -> [Bool] -> [a]
dec tab (lo, d) (a, p) bs
  | lo >= q1 && lo + d < q3 =
    dec tab (2*lo - q2, 2*d) ((q2 + 2*a) `mod` q4, p + 1) bs
  | b || lo + d <= q2 =
    dec tab (if b then 2*(lo - q2) else 2*lo, 2*d) (2*a `mod` q4, p + 1) bs
  | Just (c, (lo', d')) <- asum $ try <$> tab = c : dec tab (lo', d') (a, p) bs
  | h : rest <- bs = dec tab (lo, d) (if h then a + 2^(p - 1) else a, p - 1) rest
  | otherwise = []
  where
  b = lo >= q2
  try (c, (loX, dX)) = let
    lo' = lo + loX * d `div` q4
    d' = dX * d `div` q4
    in if lo' <= a && a + 2^p <= lo' + d'
      then Just (c, (lo', d')) else Nothing

decode :: [(a, (Rational, Rational))] -> [Bool] -> [a]
decode tab = dec (digitize tab) (0, q4) (0, wordSz)

Thus arithmetic coding attains the optimal expected length \(H N\) save for under 2 bits of overhead for binary encoding and the cost of either transmitting the message length or adding an end-of-message symbol with a nonzero probability.

Examples

We reuse examples and code from our notes on symbol codes.

english = unlines  -- From Wikipedia.
  [ "a 8167"
  , "b 1492"
  , "c 2782"
  , "d 4253"
  , "e 12702"
  , "f 2228"
  , "g 2015"
  , "h 6094"
  , "i 6966"
  , "j 0153"
  , "k 0772"
  , "l 4025"
  , "m 2406"
  , "n 6749"
  , "o 7507"
  , "p 1929"
  , "q 0095"
  , "r 5987"
  , "s 6327"
  , "t 9056"
  , "u 2758"
  , "v 0978"
  , "w 2360"
  , "x 0150"
  , "y 1974"
  , "z 0077"
  ]

parse :: String -> RandomVariable Char
parse s = (\(c, f) -> (c, f % total)) <$> tab
  where
  tab = catMaybes $ charFreq . filter (/= ' ') <$> lines s
  total = sum $ snd <$> tab
  charFreq "" = Nothing
  charFreq (c:n) = Just (c, read n)

entropy :: RandomVariable a -> Double
entropy rv = mean (\_ p -> shannon p) $ float rv
mean f rv = sum [p * f x p | (x, p) <- rv]
lg = logBase 2
shannon p = - lg p
float = map (\(c, p) -> (c, fromRational p))
roll rv n g
  | n == 0 = []
  | otherwise = x : roll rv (n - 1) g'
  where
  (r, g') = random g
  x = fst $ head $ dropWhile ((< r) . snd) cu
  cu = scanl1 (\(_,p) (b,q) -> (b,p + q))
    $ map (\(c, q) -> (c, fromRational q :: Double)) rv

demo x s = let
  rv = parse s
  tab = intervals rv
  e = encode tab x
  in unlines
  [ "H(X) = " ++ show (entropy rv)
  , "bits per char: " ++ show (genericLength e / genericLength x)
  , "encoded: " ++ concatMap (show . fromEnum) e
  , "decoded: " ++ decode tab e
  , "arbitrary precision interval (endpoint, length): " ++ show (zoom tab x)
  ]

#ifdef __HASTE__
main :: IO ()
main = withElems ["in", "msg", "out", "rand", "arith"] $ \[iEl, mEl, oEl, randB, arithB] -> do
  let
    setInp s = setProp iEl "value" s
    setMsg s = setProp mEl "value" s
    setOut s = setProp oEl "value" s
    handle buttonId s m = do
      Just b <- elemById buttonId
      void $ b `onEvent` Click $ const $ setInp s >> setMsg m >> setOut ""
    sentence = "thequickbrownfoxjumpsoverthelazydog"
  setInp english
  setMsg sentence
  handle "english" english sentence
  handle "fair" "h 1\nt 1\n" "hhh"
  handle "foul" "h 9\nt 1\n" "hhh"
  handle "roshambo" "R 1\nP 1\nS 1\n" $ concat $ replicate 32 "RPS"
  handle "straddle" "A 1\nB 2\nC 1\n" $ replicate 50 'B' ++ "A"
  let
    showOff = do
      m <- getProp mEl "value"
      setProp oEl "value" . demo m =<< getProp iEl "value"
  void $ arithB `onEvent` Click $ const $ showOff
  void $ randB `onEvent` Click $ const $ do
    s <- getProp iEl "value"
    g <- newStdGen
    setMsg $ roll (parse s) 64 g
    showOff
#endif