Information Theory

What is information?

One example might be knowing your birthday. If you viewed it as a sort of game, how many points would you award someone guessing your birthday right away? With apologies to those born on February 29th, a reasonable answer might be 365 points.

How about guessing just the month? We might say this is worth 12 points, but this is slightly unfair since some months are longer than others. Thus guessing December correctly, for example, ought to be worth a touch less: 365/31 points.

Suppose we know the birthday is in December. How many points should be awarded for guessing the day? Surely 31 points. So if someone guesses December correctly and then guesses the day correctly, they get 365/31 points and then 31 points. The product is 365 points, which is exactly the reward for guessing the birthday in one shot.

What if their first guess is wrong? Then if they get it right on the next go, it seems they should only get 364 points, because they already knew one guess to avoid. Thus we could say the value of knowing one unbirthday is 365/364, so that we get 365 when multiply by 364.

If we continue this line of thought, our wish list for some kind of measure of information content might be:

  • The information content of a fact X determines the information content of the negation of X.

  • If there is more than one way to calculate the information content, then they must all lead to the same result.

  • The information content of the conjunction of A and B given background knowledge X is determined by the information content of A given X, and the information content of B given we know A and X.

…​and so on, leading to Cox’s Theorem, from which we conclude:

information content = probability

However, information theorists take logarithms of probabilities before working with them. In his seminal paper A Mathematical Theory of Communication, Claude Shannon cites Hartley (Transmission of Information) and observes in the context of communication, logarithms better dovetailed with:

  1. Practice: "time, bandwidth, number of relays, etc. tend to vary linearly with the logarithm".

  2. Intuition: "two punched cards should have twice the capacity of one".

  3. Mathematics: calculations often involve multiplying by the number of possibilities; following Napier, we use logarithms to simplify computations.

We make a couple more aesthetic choices. Firstly, the information content ought to go up when a fact is harder to guess. Thus we start with reciprocals of probabilities, or equivalently, discard the minus signs after taking logarithms of probabilities.

Secondly, as Shannon had computers in mind, he chose base 2 logarithms, and called the unit of information a bit, a word coined by John Tukey.

The Shannon information content of an outcome \(x\) is:

\[ h(x) = \lg \frac{1}{P(x)} \]

{-# LANGUAGE CPP #-}
#ifdef __HASTE__
import Haste.DOM
import Haste.Events
import Control.Monad (void)
#endif
import Control.Monad (replicateM)
import Data.Maybe (catMaybes)
import Data.List (sortOn, genericLength)
import System.Random

lg = logBase 2
shannon p = -(lg p)

where \(P(x)\) is the probability of \(x\) occurring.

Let \(X\) be a random variable and \(\mathcal{A}_X\) the set of possible values of \(X\). We call \(\mathcal{A}_X\) the alphabet of \(X\), though in general this term may be a bit misleading as \(\mathcal{A}_X\) could be, for example, the set of all 5-letter strings.

The entropy of \(X\) is the average information content:

\[ H(X) = \sum P(x) h(x) [x \in \mathcal{A}_x] \]

-- Association list of symbols and their probabilities.
type RandomVariable a = [(a, Double)]
mean :: (Double -> Double) -> RandomVariable a -> Double
mean f rv = sum [p * f p | (_, p) <- rv]
variance :: (Double -> Double) -> RandomVariable a -> Double
variance f rv = sum [p*(f p - mu)^2 | (_, p) <- rv] where mu = mean f rv
entropy :: RandomVariable a -> Double
entropy rv = mean shannon rv

The raw bit content of X is

\[ H_0(X) = \lg |\mathcal{A}_X| \]

size = genericLength
rawBitContent rv = lg $ size rv

This measures the room we need to record one symbol of \(X\).

Exercise: Show \( 0 \le H(X) \le H_0(X) \). When do we have equality on either side?

The answers follow directly from the mathematics, but we can interpret them qualitatively. Consider a lopsided distribution where one symbol almost always appears while the others are rare. Then the average symbol will be the common one, which gives us almost no information. In the extreme case, the other symbols are impossible so we learn nothing from the only symbol that exists. In contrast, if the distribution is uniform, we have no idea what symbol to expect next (assuming at least two symbols).

Lossy Compression

Ever filled out a form asking you to write one letter per box? This is perhaps the simplest lossy compression scheme. It works when long names are rare. In a fixed amount of space, the form accurately records everything except for some improbable cases.

What if we reject any improbable message, long or short? Then a small amount of space may hold a long message, provided it is not too improbable. For example, suppose a one-million-bit message is unlikely to contain more than a few 1s; perhaps it describes the inhabitants of a city of population one million who have won a lottery jackpot. Then by storing the positions of any 1s, perhaps five 20-bit numbers may suffice for describing almost any one-million-bit message.

We generalize this idea. The smallest \(\delta\)-sufficient subset \(S_\delta\) is the smallest subset of \(\mathcal{A}_X\) such that:

\[ P(x \in S_\delta) \ge 1 - \delta \]

We can compute this by sorting the symbols in order of increasing probability, then discarding symbols until their cumulative probability exceeds \(\delta\). In our lottery example, if we consider each possible one-million-bit string to be a symbol, then the only strings left standing are those containing a few 1s at most.

cumu rv = scanl1 (\(_,p) (b,q) -> (b,p + q)) $ sortOn snd rv
sufficient delta rv = map fst $ dropWhile ((<= delta) . snd) $ cumu rv

The essential bit content of X is:

\[ H_\delta(X) = \lg |S_\delta| \]

essential delta rv = lg $ size $ sufficient delta rv

That is, if we are willing to risk failure with probability \(\delta\), then we only need \(H_\delta(X)\) bits.

Typicality

Let \(H = H(X)\) be the entropy of \(X\). Let \(X^N\) be the random variable whose outcome is \(N\) independent outcomes of \(X\). How much can we compress the messages of \(X^N\) with at most a probability \(\delta\) of failure?

By definition, the best answer is \(H_\delta(X^N)\), but reasoning about this quantity is difficult, so we sneak up on it by showing a message is most likely a member of a certain typical set, and studying the asymptotic limits of the size of this typical set.

Define the typical set \(T_{N\beta}\) by:

\[ T_{N \beta} = \left\{ x \in \mathcal{A}_{X^N} : \left| \frac{h(x)}{N} - H \right| < \beta \right\} \]

typical rv n beta = [s | s <- replicateM n $ map fst rv,
  abs (info rv s/fromIntegral n - h) < beta]
  where h = entropy rv

We compute \(h(x)\) for an outcome \(x\) of \(X^N\) by summing the Shannon information contents of each of the \(N\) independent outcomes of \(X\):

pr rv sym = maybe (error "bad symbol") id $ lookup sym rv
info rv s = sum $ shannon . pr rv <$> s

We may think of the typical set is the set of all strings whose information content is close to the expected information content \(N H\).

It turns out the typical set is a Goldilocks set. There may be messages which are more likely than those in the typical set, but they are so few in number that they are seldom seen. There may be many messages less likely than those in the typical set, but they are so unlikely that they also seldom seen.

We now see why it pays to design definitions to fit with standard probability theory. For any \(x \in T_{N\beta}\), by the weak law of large numbers:

\[ P(x \in T_{N\beta}) \ge 1 - \frac{\sigma^2 }{\beta^2 N} \]

where \(\sigma^2\) is the variance of the Shannon information content of \(X\). As claimed, a string most likely lies in the typical set.

We undo logarithms to express the typical set’s defining inequality in terms of probabilities:

\[ 2^{-N(H+\beta)} \le P(x) \le 2^{-N(H-\beta)} \]

The sum of all probabiltiies must be 1, so the left inequality implies:

\[ |T_{N\beta}| 2^{-N(H+\beta)} < 1 \]

so:

\[ |T_{N\beta}| < 2^{N(H+\beta)} \]

Asymptotic equipartition principle: For sufficiently large \(N\) the outcome \(x\) of \(N\) i.i.d. random variables \(X^N = {X_1, …​, X_N}\) almost certainly belongs to a certain set of size \(2^{N H}\), each member with a probability "close" to \(2^{-N H}\).

More formally, given \(\epsilon > 0\) and a probability \(\delta > 0\), choose \(\beta = \epsilon\) and \(N_0\) so that \(P(T_{N\beta}) \ge 1 - \delta\) when \(N \ge N_0\), that is \(\frac{\sigma2}{\epsilon2 N_0} \le \delta\). Then:

\[ \lg |T_{N\beta}| < N(H + \epsilon) \]

We place scare quotes around "close" because the probabilities of two elements in the typical set can differ by a factor that is exponential in our parameters.

Shannon’s source coding theorem

The bound for \(|T_{N\beta}|\) immediately yields an upper bound for the size of the best possible set for lossy compression:

\[ H_\delta(X^N) \le \lg |T_{N\beta}| < N(H + \epsilon) \]

How about the other side? Can \(S_\delta\), or any other set, be significantly smaller than the typical set but still compress just as well?

Let \(S\) be a set such that \(|S| \le 2^{N(H - 2\beta)}\). Then

\[\begin{align} P(x\in S) &=& P(x\in S \cap T_{N\beta}) + P(x\in S\cap \overline{T_{N\beta}}) \\ &\le& |S| \max \{ P(x) : x \in T_{N\beta} \} + P(x \notin T_{N\beta}) \\ &\le& 2^{N(H-2\beta)} 2^{-N(H-\beta)} + \frac{\sigma^2}{\beta^2 N} \\ &=& 2^{-N\beta} + \frac{\sigma^2}{\beta^2 N} \end{align}\]

Hence for any given \(\delta\), we can choose \(N_0\) and \(\epsilon\) so that \(P(x \in S) \le 1 - \delta\), hence \(S\) fails to compress as well as \(S_\delta\), or the typical set.

Summing up, we have Shannon’s source coding theorem: Given \(\epsilon > 0\) and \(0 < \delta < 1\) there exists a positive integer \(N_0\) such that for all \(N > N_0\):

\[ \left| \frac{H_\delta (X^N)}{N} - H \right| < \epsilon \]

In other words, with a negligible risk of information loss, we can compress a string of \(N\) symbols to about \(N H\) bits, and no better.

Examples

See David Mackay, Information Theory, Inference and Learning Algorithms, which explains all the above in detail.

Our interactive demo above gives concrete numbers.

We expect symbols (single characters) and their frequencies, which we read with a primitive parser:

parse :: String -> RandomVariable Char
parse s = (\(c, f) -> (c, fromIntegral f / fromIntegral total)) <$> tab
  where
  tab = catMaybes $ charFreq . filter (/= ' ') <$> lines s
  total = sum $ snd <$> tab
  charFreq "" = Nothing
  charFreq (c:n) = Just (c, read n)
english = unlines
  [ "a 8167"
  , "b 1492"
  , "c 2782"
  , "d 4253"
  , "e 12702"
  , "f 2228"
  , "g 2015"
  , "h 6094"
  , "i 6966"
  , "j 0153"
  , "k 0772"
  , "l 4025"
  , "m 2406"
  , "n 6749"
  , "o 7507"
  , "p 1929"
  , "q 0095"
  , "r 5987"
  , "s 6327"
  , "t 9056"
  , "u 2758"
  , "v 0978"
  , "w 2360"
  , "x 0150"
  , "y 1974"
  , "z 0077"
  ]

Another example is rolling two six-sided dice:

twoDice = unlines
  [ "2 1"
  , "3 2"
  , "4 3"
  , "5 4"
  , "6 5"
  , "7 6"
  , "8 5"
  , "9 4"
  , "a 3"
  , "b 2"
  , "c 1"
  ]

We show the entropy and related statistics of the corresponding random variable. Next, we randomly generate a message \(x\) of length 100 according to the table, and compute how far its Shannon information content per symbol deviates from the entropy, that is, the criterion used to define the typical set. We call this quantity \(t\).

We also compute a bound that \(t\) exceeds with at most 20% probability in a message of length 100. Experimentally, the bound seems loose, which is perhaps unsurprising as the bound derives from the Weak Law of Large Numbers. Our main goal is to establish a theoretical limit for lossy compression, rather than produce a practical algorithm. Indeed, it’s unclear to me how to efficiently encode or decode a member of a typical set \(T\) so it takes about \(\lg |T|\) bits.

roll rv n g
  | n == 0 = []
  | otherwise = x:roll rv (n - 1) g'
  where
  (r, g') = random g
  x = fst $ head $ dropWhile ((< r) . snd) $ cumu rv

go g s = let
  h = entropy rv
  sigma2 = variance shannon rv
  x = roll rv 100 g
  hx100 = info rv x / 100
  rv = parse s in
    showStat "raw bit content" (rawBitContent rv)
  . showStat "H = entropy" h
  . showStat "variance" sigma2
  . showStat "x <- sample X^100" x
  . showStat "h(x) / 100" hx100
  . showStat "t = |h(x) / 100 - H|" (abs $ hx100 - h)
  . showStat "with at most 20% probability, t exceeds" (sqrt $ sigma2 / (0.2 * 100))
  $ ""

showStat s f = (s++) . (": "++) . shows f . ('\n':)

#ifdef __HASTE__
main :: IO ()
main = withElems ["in", "out", "go"] $ \[iEl, oEl, goB] -> do
  let
    setInp s = setProp iEl "value" s >> setProp oEl "value" ""
    handle buttonId s = do
      Just b <- elemById buttonId
      void $ b `onEvent` Click $ const $ setInp s
  setInp english
  handle "english" english
  handle "fair" "h 1\nt 1\n"
  handle "foul" "h 9\nt 1\n"
  handle "2d6" twoDice
  void $ goB `onEvent` Click $ const $ do
    g <- newStdGen
    setProp oEl "value" . go g =<< getProp iEl "value"
#endif

Without spaces, English text takes about 4.7 bits per character, that is, if we interpret an N-letter string as a big base-26 number and convert it to binary, we wind up with about 4.7N bits.

Going by just the letter frequencies, we find the entropy is around 4.2 bits, so given enough text, we can save about 0.5 bits per character (over 10%) with negligible probability of data loss.


Ben Lynn blynn@cs.stanford.edu 💡