{# LANGUAGE CPP #}
#ifdef __HASTE__
import Haste.DOM
import Haste.Events
import Control.Monad (void)
#endif
import Data.Foldable (asum)
import Data.List (genericLength, find)
import Data.Maybe (catMaybes)
import Data.Ratio
import System.Random
Stream Codes
Symbol codes are tyrannized by the powers of 2. Huffman codes approach optimal compression only when the Shannon information contents of each symbol happen to be close to powers of 2.
Arithmetic coding liberates us from binary, and for any probability distribution, leads to optimal stream codes.
Symbol codes can be viewed as partitioning the unit interval into intervals whose sizes are all negative powers of 2 where the lower endpoint is closed and the higher endpoint is open. We assign intervals to symbols. The more probable a symbol, the larger its interval should be, and the fewer the bits needed to represent it. Furthermore, after decoding a symbol, we can view the next bits as partitioning the symbol’s interval in the same proportions we used on the unit interval.
If we imagine prefixing "0." to an encoded message, then we have the lower endpoint of the relevant interval in binary, and the number of significant figures determines the size of the interval.
For example, consider the symbol code:
A 0 B 10 C 11
Then we prefix "0." to the codes to find the intervals they represent:
A [0.0, 0.1) B [0.10, 0.11) C [0.11, 1.00)
As decimal fractions, these are [0, 1/2), [1/2,3/4), and [3/4,1).
Continuing our example, "AA" encodes to "00" and "BC" encodes to "1011". These correspond to binary numbers "0.00" and "0.1011". The first represents the interval [0.00, 0.01), and the second is [0.1011, 0.1100).
Beyond binary
What if we allow intervals of arbitrary sizes? For recording a sequence of moves in rockpaperscissors, assuming a uniform distribution, it would be optimal to use base 3, namely divide the unit interval into thirds, then divide each of those into thirds, and so on.
More generally, the probability distribution of an information source would describe exactly how to partition the unit interval, and lead to an optimal encoding.
We represent probabilities with arbitrary precision rationals.
type RandomVariable a = [(a, Rational)]
We represent an interval with its lower endpoint and its length. The following converts probabilities to intervals.
intervals :: RandomVariable a > [(a, (Rational, Rational))]
intervals rv = tail $ scanl accum (undefined, (0, 0)) rv where
accum (_, (lo, d)) (x, p) = (x, (lo + d, p))
Let’s code! Roughly speaking, we start with the unit interval, and as we encode symbols, we zoom in on subintervals of our current interval.
zoom :: Eq a => [(a, (Rational, Rational))] > [a] > (Rational, Rational)
zoom tab = go (0, 1) where
go (lo, d) syms
 sym : rest < syms, Just (lo', d') < lookup sym tab =
go (lo + d * lo', d * d') rest
 otherwise = (lo, d)
Mission accomplished? No, it’s hardly a stream code if we analyze the entire message to encode it. Also, for practical reasons, we wish to output 0s and 1s (though see range encoding).
Since we ought to produce output as we go, we convert what we can to binary on the fly. When both endpoints lie in the same half of the interval, we output one bit then double everything.
We finish by sending enough bits to describe an interval lying within the final interval.
encRat tab (lo, d) syms
 b  lo + d <= 1%2 = b :
encRat tab (if b then 2*lo  1 else 2*lo, 2*d) syms
 sym : rest < syms, Just (lo', d') < lookup sym tab =
encRat tab (lo + d * lo', d * d') rest
 d == 1 = []
 otherwise = finish 0
where
b = lo >= 1%2
finish k
 1%2 + 1%2^(k + 1) <= lo + d = True : replicate k False
 1%2  1%2^(k + 1) >= lo = False : replicate k True
 otherwise = finish $ k + 1
encodeRat :: Eq a => [(a, (Rational, Rational))] > [a] > [Bool]
encodeRat tab = encRat tab (0, 1)
Decoding inverts this process.
decodeRat :: [(a, (Rational, Rational))] > [Bool] > [a]
decodeRat tab = go (0, 1) where
go (lo, d) bs
 Just (c, (lo', d')) < find f tab = c : go ((lo  lo')/d', d/d') bs
 b : rest < bs = go (if b then lo + d/2 else lo, d/2) rest
 otherwise = []
where f (_, (lo', d')) = lo' <= lo && lo + d <= lo' + d'
This seems to be an elegant generalization of symbol codes, but on closer inspection, there is cause for concern:

Suppose an interval only takes a few ternary digits (trits) to represent. Does approximating the interval in binary require far more bits than we’d like? What about other bases?

With symbol codes, the intervals are exact so the decoder knows when the message ends. However, our scheme uses approximations. Given an approximation, how do we know if it’s a coarse approximation to a big interval, or a finer approximation of some subinterval within?

The precision of our computations grows with the message. Can we limit the precision?
Let’s start with the good news. Let \(I\) be the interval \([0, m)\). Then \(I\) contains \([0, 1/2^n)\) where \(n = \lceil \lg 1/m \rceil\). If we translate \(I\), sometimes we find it contains an approximation of the form \([k/2^n, (k+1)/2^n)\), but sometimes not, because it straddles a point \(j/2^n\) for some \(j\).
But then at least one of the halfintervals on either side lies in \(I\), that is, the interval \(I\) contains at least one of \([2j / 2^{n+1}, (2j + 1) / 2^{n+1})\) or \([(2j  1)/2^{n+1}, 2j / 2^{n+1})\), otherwise \(1/2^n\) would exceed \(m\), a contradiction.
In other words, we can approximate any interval of length \(m\) with at most \(\lceil \lg 1/m \rceil + 1\) bits. Thus, the entire encoded message has under 2 bits of overhead.
Now the bad news. In general, our decoder cannot determine when to stop decoding. Consider a biased coin that shows heads 9 times out of 10. We encode heads with [0, 9/10) and tails with [9/10,1), which is optimal. But what do we make of the encoded message "0"? It means \([0,1/2)\), which is contained within:

\([0,9/10)\), which decodes to "H"

\([0,(9/10)^2)\), which decodes to "HH"
…

\([0,(9/10)^6)\), which decodes to "HHHHHH".
There is no way to recover the original message. We must send its length separately, or add a special endofmessage symbol and assign it some probability.
Our decoder above always chooses the longest possibility.
Adaptive encoding
Let’s tackle arbitrary precision. We pick \(w\), some fixed power of 2, and round our probabilities to the nearest rational whose denominator is \(w\). This is harmless for sufficiently large \(w\), as a tiny difference in probabilities has a negligible effect on compression, and we can now represent endpoints with integers. They are the numerators of fractions with denominator \(w\).
We’ve been sending off the mostsignificant bits when we can. What if we also lop off leastsignificant bits? That is, during encoding, we simply drop the low bits so that numbers representing the endpoints can stay under \(w\)?
This amounts to perturbing the probability distribution as we go; we can view fixed precision as nudging probabilities so numbers happen to line up with certain powers of 2, so we only ever drop zero bits. Again, for large enough \(w\), the changes to the probabilities are tiny, so have a negligible effect on compression.
Changing probabilities on the fly is known as adaptive encoding. Apart from letting us get away with fixed precision, adaptive encoding can also improve compression by incorporating more facts about our information source.
For example, in English,"q" is almost always followed by "u", so after encoding a "q" we ought to give "u" a heavy weight in our probability distribution for the next letter. Naturally, the decoder must make the same assumptions.
We can also apply adaptive encoding to symbol codes, but the gains are more modest.
Halfway there
However, fixed precision raises a new concern. When dropping the leastsignificant bits, can we drop too much and lose information?
Unfortunately, we can. Consider the stream code:
A [0, 1/4) B [1/4, 3/4) C [3/4, 1)
A long string of "B"s leads to an interval that shrinks around 1/2. If eventually followed by an "A", then the encoding should be 0111…, that is, something just below 1/2. On the other hand, if eventually followed by a "C", then the encoding should be 1000…, that is, something just above 1/2. Until then, a naive encoder must stay silent, and if there are too many "B"s, something will go wrong. See the tablemaker’s dilemma, a related problem.
Hugging 1/2 is the sole pitfall, for whenever the interval lies to one side, we can send off the most significant bit. Fortunately there is a simple fix. If the interval lies in [1/4, 3/4), we double it (more precisely, we widen it by doubling each endpoint’s distance from 1/2) and increment a counter. When the interval finally chooses a side, we output 011..1 or 100..0 accordingly, where the number of bits is equal to the counter. We then reset the counter.
Thus the interval is always greater than 1/4 in size, and we avoid underflow if we compute with just 2 more bits than the precision used to describe probabilities in the distribution.
(Symbol codes sidestep this problem by construction: their intervals never straddle 1/2.)
Fixedprecision arithmetic coding
We demonstrate with 16bit precision, though for clarity we use mathematical operations rather than bitwise operations. Our implementation actually involves 32bit operations.
wordSz = 16
We define constants representing the points that divide the unit interval into quarters:
q1, q2, q3, q4 :: Word
q1 = 2^(wordSz  2)
q2 = 2^(wordSz  1)
q3 = q1 + q2
q4 = 2^wordSz
We convert the probability intervals to the form \([a/2^{16}, b/2^{16})\). As above, we store them as the lower endpoint and interval size. Since the denominator is always q4, we just store \((a, b  a)\).
digitize :: [(a, (Rational, Rational))] > [(a, (Word, Word))]
digitize tab = map (\(c, (x, y)) > (c, (circa x, underCk $ circa y))) tab
where
circa r = floor (fromIntegral q4 * r)
underCk y
 y < 4 = error "potential underflow"
 otherwise = y
We’ve conservatively taken the floor to approximate rationals, which means we may leave a sliver of the unit interval unused, and the approximations may be a touch worse than they ought to be. We could do better by calling round instead of floor and forcing the last interval to (a, q4  a).
The encoder resembles our arbitrary precision version, except for a counter and an extra case for intervals contained in [1/4, 3/4):
enc :: Eq a => Int > [(a, (Word, Word))] > (Word, Word) > [a] > [Bool]
enc ctr tab (lo, d) syms
 lo >= q1 && lo + d <= q3 = enc (ctr + 1) tab (2*lo  q2, d*2) syms
 b  lo + d <= q2 = b : replicate ctr (not b)
++ enc 0 tab (if b then 2*(lo  q2) else 2*lo, 2*d) syms
 sym : rest < syms, Just (lo', d') < lookup sym tab =
enc ctr tab (lo + lo' * d `div` q4, d' * d `div` q4) rest
 d == q4 = if ctr == 0 then [] else False : replicate ctr True
 otherwise = finish 0
where
b = lo >= q2
finish k
 q2 + 2^(wordSz  k  1) <= lo + d = True : replicate (ctr + k) False
 q2  2^(wordSz  k  1) >= lo = False : replicate (ctr + k) True
 otherwise = finish $ k + 1
encode :: Eq a => [(a, (Rational, Rational))] > [a] > [Bool]
encode tab = enc 0 (digitize tab) (0, q4)
We take pains to ensure the decoder recreates the encoder’s state of mind exactly, changing the probability distribution in lockstep.
The tuple (a, p) represents the interval [a, a + 2^p) and the lower p bits of a are always zero.
dec :: [(a, (Word, Word))] > (Word, Word) > (Word, Int) > [Bool] > [a]
dec tab (lo, d) (a, p) bs
 lo >= q1 && lo + d < q3 =
dec tab (2*lo  q2, 2*d) ((q2 + 2*a) `mod` q4, p + 1) bs
 b  lo + d <= q2 =
dec tab (if b then 2*(lo  q2) else 2*lo, 2*d) (2*a `mod` q4, p + 1) bs
 Just (c, (lo', d')) < asum $ try <$> tab = c : dec tab (lo', d') (a, p) bs
 h : rest < bs = dec tab (lo, d) (if h then a + 2^(p  1) else a, p  1) rest
 otherwise = []
where
b = lo >= q2
try (c, (loX, dX)) = let
lo' = lo + loX * d `div` q4
d' = dX * d `div` q4
in if lo' <= a && a + 2^p <= lo' + d'
then Just (c, (lo', d')) else Nothing
decode :: [(a, (Rational, Rational))] > [Bool] > [a]
decode tab = dec (digitize tab) (0, q4) (0, wordSz)
Thus arithmetic coding attains the optimal expected length \(H N\) save for under 2 bits of overhead for binary encoding and the cost of either transmitting the message length or adding an endofmessage symbol with a nonzero probability.
Examples
We reuse examples and code from our notes on symbol codes.
english = unlines  From Wikipedia.
[ "a 8167"
, "b 1492"
, "c 2782"
, "d 4253"
, "e 12702"
, "f 2228"
, "g 2015"
, "h 6094"
, "i 6966"
, "j 0153"
, "k 0772"
, "l 4025"
, "m 2406"
, "n 6749"
, "o 7507"
, "p 1929"
, "q 0095"
, "r 5987"
, "s 6327"
, "t 9056"
, "u 2758"
, "v 0978"
, "w 2360"
, "x 0150"
, "y 1974"
, "z 0077"
]
parse :: String > RandomVariable Char
parse s = (\(c, f) > (c, f % total)) <$> tab
where
tab = catMaybes $ charFreq . filter (/= ' ') <$> lines s
total = sum $ snd <$> tab
charFreq "" = Nothing
charFreq (c:n) = Just (c, read n)
entropy :: RandomVariable a > Double
entropy rv = mean (\_ p > shannon p) $ float rv
mean f rv = sum [p * f x p  (x, p) < rv]
lg = logBase 2
shannon p =  lg p
float = map (\(c, p) > (c, fromRational p))
roll rv n g
 n == 0 = []
 otherwise = x : roll rv (n  1) g'
where
(r, g') = random g
x = fst $ head $ dropWhile ((< r) . snd) cu
cu = scanl1 (\(_,p) (b,q) > (b,p + q))
$ map (\(c, q) > (c, fromRational q :: Double)) rv
demo x s = let
rv = parse s
tab = intervals rv
e = encode tab x
in unlines
[ "H(X) = " ++ show (entropy rv)
, "bits per char: " ++ show (genericLength e / genericLength x)
, "encoded: " ++ concatMap (show . fromEnum) e
, "decoded: " ++ decode tab e
, "arbitrary precision interval (endpoint, length): " ++ show (zoom tab x)
]
#ifdef __HASTE__
main :: IO ()
main = withElems ["in", "msg", "out", "rand", "arith"] $ \[iEl, mEl, oEl, randB, arithB] > do
let
setInp s = setProp iEl "value" s
setMsg s = setProp mEl "value" s
setOut s = setProp oEl "value" s
handle buttonId s m = do
Just b < elemById buttonId
void $ b `onEvent` Click $ const $ setInp s >> setMsg m >> setOut ""
sentence = "thequickbrownfoxjumpsoverthelazydog"
setInp english
setMsg sentence
handle "english" english sentence
handle "fair" "h 1\nt 1\n" "hhh"
handle "foul" "h 9\nt 1\n" "hhh"
handle "roshambo" "R 1\nP 1\nS 1\n" $ concat $ replicate 32 "RPS"
handle "straddle" "A 1\nB 2\nC 1\n" $ replicate 50 'B' ++ "A"
let
showOff = do
m < getProp mEl "value"
setProp oEl "value" . demo m =<< getProp iEl "value"
void $ arithB `onEvent` Click $ const $ showOff
void $ randB `onEvent` Click $ const $ do
s < getProp iEl "value"
g < newStdGen
setMsg $ roll (parse s) 64 g
showOff
#endif