Information Theory

{-# LANGUAGE CPP #-}
#ifdef __HASTE__
import Haste.DOM
import Haste.Events
import Control.Monad (void)
#endif
import Control.Monad (replicateM)
import Data.Maybe (catMaybes)
import Data.List (sortOn, genericLength)
import System.Random

What is information? When answering this question, game theory or economics may come to mind. We might imagine a trader who receives important news, then immediately acts upon it for financial gain.

We wrote "important". Why is one item of news more valuable than another? For one, the more improbable an event, the greater its significance. Think of typical responses when news travels fast: "No way!"; "Incredible!"; "I don’t believe it!" There’s more to news than this, but we focus on probability alone so we can tame information mathematically.

Information ought to relate to knowledge in general, not just current events, but we can handle existing facts by acting as if we were unaware of them, as in the expression: "that’s news to me". For example, the information content of your birthday should involve the probability that a stranger would guess it correctly, namely 1/365 (ignoring leap years).

As a first stab, we might say knowing someone’s birthday is worth 365 points, while only knowing their zodiac sign is worth 12 points. If we guess someone’s birthday correctly the first time, then we get 365 points, which raises a question: could the information content of a guess simply be the inverse of the probability of its outcome?

\[ \frac{1}{p} \]

Then an incorrect first guess is worth 365/364 points because there is a 364/365 chance of it being wrong, and similarly a second wrong guess is 364/363 points. If our third guess is right, then it’s worth 363 points. Multiplying these 3 scores yields 365 points, matching the case when our first guess was right.

However, multiplication is awkward. Mathematicians first studied probability because of questions involving gambling, so the theory deals with payoffs that are additive, just like money. We could rework theory to deal with multiplicative rewards, but better to rework our definition to fit existing infrastructure. Taking logarithms lets us add instead of multiply. Being computer scientists, we use base 2 to get answers in bits.

lg = logBase 2

The Shannon information content of an outcome \(x\) is:

\[ h(x) = \lg \frac{1}{P(x)} \]
shannon p = - lg p

Let \(X\) be a random variable and \(\mathcal{A}_X\) the set of possible values of \(X\). We call \(\mathcal{A}_X\) the alphabet of \(X\), though in general this term may be a bit misleading as \(\mathcal{A}_X\) could be, for example, the set of all 5-letter strings.

The entropy of \(X\) is the average information content:

\[ H(X) = \sum P(x) h(x) [x \in \mathcal{A}_x] \]
-- Association list of symbols and their probabilities.
type RandomVariable a = [(a, Double)]
mean :: (Double -> Double) -> RandomVariable a -> Double
mean f rv = sum [p * f p | (_, p) <- rv]
variance :: (Double -> Double) -> RandomVariable a -> Double
variance f rv = sum [p*(f p - mu)^2 | (_, p) <- rv] where mu = mean f rv
entropy :: RandomVariable a -> Double
entropy rv = mean shannon rv

The raw bit content of X is

\[ H_0(X) = \lg |\mathcal{A}_X| \]
size = genericLength
rawBitContent rv = lg $ size rv

This measures the room we need to record one symbol of \(X\).

Exercise: Show \( 0 \le H(X) \le H_0(X) \). When do we have equality on either side?

The answers follow directly from the mathematics, but we can interpret them qualitatively. Consider a lopsided distribution where one symbol almost always appears while the others are rare. Then the average symbol will be the common one, which gives us almost no information. In the extreme case, the other symbols are impossible so we learn nothing from the only symbol that exists. In contrast, if the distribution is uniform, we have no idea what symbol to expect next (assuming at least two symbols).

Lossy Compression

Ever filled out a form asking you to write one letter per box? This is perhaps the simplest lossy compression scheme. It works when long names are rare. In a fixed amount of space, the form accurately records everything except for some improbable cases.

What if we reject any improbable message, long or short? Then a small amount of space may hold a long message, provided it is not too improbable. For example, suppose a one-million-bit message is unlikely to contain more than a few 1s; perhaps it describes the inhabitants of a city of population one million who have won a lottery jackpot. Then by storing the positions of any 1s, perhaps five 20-bit numbers may suffice for describing almost any one-million-bit message.

We generalize this idea. The smallest \(\delta\)-sufficient subset \(S_\delta\) is the smallest subset of \(\mathcal{A}_X\) such that:

\[ P(x \in S_\delta) \ge 1 - \delta \]

We can compute this by sorting the symbols in order of increasing probability, then discarding symbols until their cumulative probability exceeds \(\delta\). In our lottery example, if we consider each possible one-million-bit string to be a symbol, then the only strings left standing are those containing a few 1s at most.

cumu rv = scanl1 (\(_,p) (b,q) -> (b,p + q)) $ sortOn snd rv
sufficient delta rv = map fst $ dropWhile ((<= delta) . snd) $ cumu rv

The essential bit content of X is:

\[ H_\delta(X) = \lg |S_\delta| \]
essential delta rv = lg $ size $ sufficient delta rv

That is, if we are willing to risk failure with probability \(\delta\), then we only need \(H_\delta(X)\) bits.

Typicality

Let \(H = H(X)\) be the entropy of \(X\). Let \(X^N\) be the random variable whose outcome is \(N\) independent outcomes of \(X\). How much can we compress the messages of \(X^N\) with at most a probability \(\delta\) of failure?

By definition, the best answer is \(H_\delta(X^N)\), but reasoning about this quantity is difficult. We take an indirect approach. We will show a message is most likely a member of a certain typical set, and furthermore, the size of this typical set asymptotically approaches \(H_\delta(X)\) from both sides.

Define the typical set \(T_{N\beta}\) by:

\[ T_{N \beta} = \left\{ x \in \mathcal{A}_{X^N} : \left| \frac{h(x)}{N} - H \right| < \beta \right\} \]
typical rv n beta = [s | s <- replicateM n $ map fst rv,
  abs (info rv s/fromIntegral n - h) < beta]
  where h = entropy rv

We compute \(h(x)\) for an outcome \(x\) of \(X^N\) by summing the Shannon information contents of each of the \(N\) independent outcomes of \(X\):

pr rv sym = maybe (error "bad symbol") id $ lookup sym rv
info rv s = sum $ shannon . pr rv <$> s

We may think of the typical set is the set of all strings whose information content is close to the expected information content \(N H\).

It turns out the typical set is a Goldilocks set. There may be messages which are more likely than those in the typical set, but they are so few in number that they are seldom seen. There may be many messages less likely than those in the typical set, but they are so unlikely that they also seldom seen.

We now see why it pays to design definitions to fit with standard probability theory. For any \(x \in T_{N\beta}\), by the weak law of large numbers:

\[ P(x \in T_{N\beta}) \ge 1 - \frac{\sigma^2 }{\beta^2 N} \]

where \(\sigma^2\) is the variance of the Shannon information content of \(X\). As claimed, a string most likely lies in the typical set.

We undo logarithms to express the typical set’s defining inequality in terms of probabilities:

\[ 2^{-N(H+\beta)} \le P(x) \le 2^{-N(H-\beta)} \]

The sum of all probabiltiies must be 1, so by considering a lower bound for each probability in the typical set:

\[ |T_{N\beta}| 2^{-N(H+\beta)} < 1 \]

so:

\[ |T_{N\beta}| < 2^{N(H+\beta)} \]

Asymptotic equipartition principle: For sufficiently large \(N\) the outcome \(x\) of \(N\) i.i.d. random variables \(X^N = {X_1, ..., X_N}\) almost certainly belongs to a certain set of size \(2^{N H}\), each member with a probability "close" to \(2^{-N H}\).

More formally, given \(\epsilon > 0\) and a probability \(\delta > 0\), choose \(\beta = \epsilon\) and \(N_0\) so that \(P(T_{N\beta}) \ge 1 - \delta\) when \(N \ge N_0\), that is \(\frac{\sigma^2}{\epsilon^2 N_0} \le \delta\). Then:

\[ \lg |T_{N\beta}| < N(H + \epsilon) \]

We place scare quotes around "close" because the probabilities of two elements in the typical set can differ by a factor that is exponential in our parameters.

Shannon’s source coding theorem

The bound for \(|T_{N\beta}|\) immediately yields an upper bound for the size of the best possible set for lossy compression:

\[ H_\delta(X^N) \le \lg |T_{N\beta}| < N(H + \epsilon) \]

How about the other side? Can \(S_\delta\), or any other set, be significantly smaller than the typical set but still compress just as well?

Let \(S\) be a set such that \(|S| \le 2^{N(H - 2\beta)}\). Then

\[ \begin{align} P(x\in S) &=& P(x\in S \cap T_{N\beta}) + P(x\in S\cap \overline{T_{N\beta}}) \\ &\le& |S| \max \{ P(x) : x \in T_{N\beta} \} + P(x \notin T_{N\beta}) \\ &\le& 2^{N(H-2\beta)} 2^{-N(H-\beta)} + \frac{\sigma^2}{\beta^2 N} \\ &=& 2^{-N\beta} + \frac{\sigma^2}{\beta^2 N} \end{align} \]

Hence for any given \(\delta\), we can choose \(N_0\) and \(\epsilon\) so that \(P(x \in S) \le 1 - \delta\), hence \(S\) fails to compress as well as \(S_\delta\), or the typical set.

Summing up, we have Shannon’s source coding theorem: Given \(\epsilon > 0\) and \(0 < \delta < 1\) there exists a positive integer \(N_0\) such that for all \(N > N_0\):

\[ \left| \frac{H_\delta (X^N)}{N} - H \right| < \epsilon \]

In other words, with a negligible risk of information loss, we can compress a string of \(N\) symbols to about \(N H\) bits, and no better.

Examples

See David Mackay, Information Theory, Inference and Learning Algorithms, which explains all the above in detail.

Our interactive demo above gives concrete numbers.

We expect symbols (single characters) and their frequencies, which we read with a primitive parser:

parse :: String -> RandomVariable Char
parse s = (\(c, f) -> (c, fromIntegral f / fromIntegral total)) <$> tab
  where
  tab = catMaybes $ charFreq . filter (/= ' ') <$> lines s
  total = sum $ snd <$> tab
  charFreq "" = Nothing
  charFreq (c:n) = Just (c, read n)
english = unlines
  [ "a 8167"
  , "b 1492"
  , "c 2782"
  , "d 4253"
  , "e 12702"
  , "f 2228"
  , "g 2015"
  , "h 6094"
  , "i 6966"
  , "j 0153"
  , "k 0772"
  , "l 4025"
  , "m 2406"
  , "n 6749"
  , "o 7507"
  , "p 1929"
  , "q 0095"
  , "r 5987"
  , "s 6327"
  , "t 9056"
  , "u 2758"
  , "v 0978"
  , "w 2360"
  , "x 0150"
  , "y 1974"
  , "z 0077"
  ]

Another example is rolling two six-sided dice:

twoDice = unlines
  [ "2 1"
  , "3 2"
  , "4 3"
  , "5 4"
  , "6 5"
  , "7 6"
  , "8 5"
  , "9 4"
  , "a 3"
  , "b 2"
  , "c 1"
  ]

We show the entropy and related statistics of the corresponding random variable. Next, we randomly generate a message \(x\) of length 100 according to the table, and compute how far its Shannon information content per symbol deviates from the entropy, that is, the criterion used to define the typical set. We call this quantity \(t\).

We also compute a bound that \(t\) exceeds with at most 20% probability in a message of length 100. Experimentally, the bound seems loose, which is perhaps unsurprising as the bound derives from the Weak Law of Large Numbers. Our main goal is to establish a theoretical limit for lossy compression, rather than produce a practical algorithm. Indeed, it’s unclear to me how to efficiently encode or decode a member of a typical set \(T\) so it takes about \(\lg |T|\) bits.

roll rv n g
  | n == 0 = []
  | otherwise = x:roll rv (n - 1) g'
  where
  (r, g') = random g
  x = fst $ head $ dropWhile ((< r) . snd) $ cumu rv

go g s = let
  h = entropy rv
  sigma2 = variance shannon rv
  x = roll rv 100 g
  hx100 = info rv x / 100
  rv = parse s in
    showStat "raw bit content" (rawBitContent rv)
  . showStat "H = entropy" h
  . showStat "variance" sigma2
  . showStat "x <- sample X^100" x
  . showStat "h(x) / 100" hx100
  . showStat "t = |h(x) / 100 - H|" (abs $ hx100 - h)
  . showStat "with at most 20% probability, t exceeds" (sqrt $ sigma2 / (0.2 * 100))
  $ ""

showStat s f = (s++) . (": "++) . shows f . ('\n':)

#ifdef __HASTE__
main :: IO ()
main = withElems ["in", "out", "go"] $ \[iEl, oEl, goB] -> do
  let
    setInp s = setProp iEl "value" s >> setProp oEl "value" ""
    handle buttonId s = do
      Just b <- elemById buttonId
      void $ b `onEvent` Click $ const $ setInp s
  setInp english
  handle "english" english
  handle "fair" "h 1\nt 1\n"
  handle "foul" "h 9\nt 1\n"
  handle "2d6" twoDice
  void $ goB `onEvent` Click $ const $ do
    g <- newStdGen
    setProp oEl "value" . go g =<< getProp iEl "value"
#endif

Without spaces, English text takes about 4.7 bits per character, that is, if we interpret an N-letter string as a big base-26 number and convert it to binary, we wind up with about 4.7N bits.

Going by just the letter frequencies, we find the entropy is around 4.2 bits, so given enough text, we can save about 0.5 bits per character (over 10%) with negligible probability of data loss.


Ben Lynn blynn@cs.stanford.edu 💡