The Ghost of Statistics Past

Flip a coin 20 times.

  1. Suppose we see heads each time. This would cause me to strongly believe the coin is unfair.

  2. Suppose we get THHTHHHTTTHHTHTHHTHH. Let us call this particular sequence \(S\). Seeing \(S\) would not cause me to strongly believe the coin is unfair. We count 12 heads in \(S\), so the coin may favour heads, but there’s too little evidence to say. Indeed, I’d believe there’s still a decent chance that the coin may even favour tails slightly.

How can we mathematically formalize my beliefs? According to my undergrad textbook:

  1. For a fair coin, the probability of getting 20 heads in 20 flips is \(2^{-20}\), which is less than 1 in a million. This is small, so the coin is likely unfair.

  2. For a fair coin, the probability of seeing at least 12 heads is approximately 0.25. As 0.25 is not small, we lack significant evidence that the coin is unfair.

The authors later state in a carefully indented paragraph: “…the smaller the probability of a result as unusual as (or even more unusual than) the observed one, the stronger our feeling that the coin is a trick coin” (emphasis theirs). Or in more general terms, for a hypothesis \(H\) and data \(D\), then if \(P(D|H)\) is small then \(H\) is likely false.

At the time, I accepted this principle without question: surely my university professors knew what they were doing! Today, I reject the above statement, for reasons we’ll discuss later.

From Contraposition to Contradiction

So that we can discuss it later, let’s name the above principle. Let’s call it quasi-contraposition, because one argument for it proceeds as follows. Suppose \(A\) implies \(B\). By the law of contraposition, if \(B\) is false, then it follows that \(A\) is false:

\[ A \rightarrow B \iff \neg B \rightarrow \neg A \]

We could ostensibly replace \(A\) with \(H\), and \(B\) with “\(D\) is unlikely” to yield: \(H\) implies that \(D\) is unlikely, so if \(D\) is true then it follows that \(H\) is likely false:

\[ H \rightarrow \neg D \mbox{ is likely } \iff D \rightarrow \neg H \mbox { is likely } \]

For now, we work through the details in the procedure given in my textbook for the sequence \(S\). We’re instructed to use the binomial distribution for this problem, so we count 12 heads in \(S\), and compute \(P(X\ge 12)\), that is, the probability of seeing at least 12 heads in 20 flips of a fair coin:

\[ \sum_k {20 \choose k} 2^{-20} [12 \le k \le 20] \]

We can compute this with a little Haskell:

> let { ch n 0 = 1; ch n k = n*(ch (n - 1) (k - 1)) / fromIntegral k }
> sum[ch 20 k | k <- [12..20]] / 2^20

The probability is indeed a bit larger than 0.25, and since this is greater than 0.05, we deem our finding insignificant, that is, we conclude we lack strong evidence to dispute the hypothesis that the coin is fair.

Lies of Omission

Although the end result matches our intuition, there are peculiarities in the procedure:

  1. The only fact we remember about the sequence \(S\) is the number 12. Why should all sequences containing exactly 12 heads be treated the same?

  2. Why do we compute \(P(X \ge 12)\)? Where did this inequality come from? We know there are exactly 12 heads and no more!

In other words, we deliberately throw away information. Twice.

Why do we wilfully neglect some of our data? I wonder what I would have answered as an undergrad. Perhaps: "The probability of seeing any particular sequence, such as \(S\), is always \(2^{-20}\), so focusing on a particular sequence obviously fails. Since each coin flip is independent of the others, it makes sense just to count the number of heads instead."

"As for the inequality: the probability of seeing exactly \(k\) heads is:

\[ P(X = k) = {20 \choose k} 2^{-20} \]

which is always too small to work with. If we replace it with \(P(X \ge k)\) for large \(k\) and \(P(X \le k)\) for small \(k\) then we get a probability that is tiny for extreme values of \(k\), but huge for reasonable values of \(k\). In other words, we get a number that can distinguish between likely and unlikely \(k\)."

In short, my younger self would say we do what we do because it works. We play with the data until we find a number that becomes tiny when conditions are extreme, so we have something that’s practical (we need only compare a number against 0.05) yet convincing (because it involves fancy mathematics).

Isn’t this intellectually unsatisfying? On the one hand, it certainly sounds better to say "following standard procedure, the P-value is less than 0.05; therefore we have significant evidence the hypothesis is false" instead of "\(k\) seems kind of extreme so the hypothesis is probably false". On the other hand, if we’re going to all this trouble to quantify how strongly we believe a hypothesis is true, why not do a proper job and justify each step, rather than settle on some ad hoc procedure?

Perhaps the procedure only appears ad hoc because the derivation is omitted because it’s too complicated for students fresh out of high school. Let’s suppose this is the case and try derive probability theory from first principles, one of which the authors insist is quasi-contraposition.

The probability of seeing any particular sequence of 20 flips such as \(S\) is \(2^{-20}\), which is tiny. By quasi-contraposition, seeing an "unusual" outcome means the coin is likely a trick coin. So no matter what, we should always believe the coin is unfair!

By the same token, consider rolling a fair \(2^{20}\)-sided die. After a single roll, we see a number that has a \(2^{-20}\) chance of showing up. Since this is a “small” number, are we to conclude that the die is unfair after all?

The inescapable conclusion: quasi-contraposition is wrong. Evidently, there are times when all outcomes are "unusual", but one of them has to occur.

Master Probability With This One Weird Trick

If quasi-contraposition is wrong, then what should we do instead?

Whatever it is, it must capture our intuition. If we flip a coin 20 times and see 20 heads, we suspect the coin is unfair. If we see the sequence \(S\), we are much less suspicious. Either event occurs with probability \(2^{-20}\) so there must be other information that affects our beliefs. What could it be?

The answer is that we are aware that trick coins exist, and willing to entertain the possibility that the coin in question is such a coin. For a fair coin, the probability of seeing 20 heads in a row is \(2^{-20}\), but for certain trick coins the probability is much higher. Indeed, an extremely unfair coin might show heads every time. We think: "Is this a fair coin that just happened to come up heads every time, or is this a trick coin that heavily favours heads? Surely the latter is likelier!"

How about the sequence \(S\)? For a fair coin, the probability of seeing the sequence \(S\) is also \(2^{-20}\), but unlike the previous case, we intuitively feel the probability of seeing \(S\) is not much higher for a trick coin. In fact, it turns out the probability of seeing \(S\) maxes out for a coin that shows heads with probability \(12/20\), at a value that is less than double \(2^{-20}\) (exercise). We think: "The sequence \(S\) is unlikely for a fair coin, but it’s about as unlikely or far more unlikely for trick coins too."

We can mathematically formalize these thoughts with one simple trick. Rather than \(P(D|H)\), we flip it around and ask for \(P(H|D)\). In other words, given the data, we find a number that represents how strongly we believe the hypothesis is true.

The probability \(P(H|D)\) is the one true principle we’ve been seeking. It’s the truth, the whole truth, and nothing but the truth. It’s the number that represents how strongly we should believe \(H\), given what we’ve seen so far. With \(P(H|D)\), the difficulties we encountered simply melt away.

Worked Example

We can compute \(P(H|D)\) with Bayes' Theorem:

\[ P(H|D) = P(H) P(D|H) / P(D) \]

Thus our previous work has not been in vain. Computing \(P(D|H)\) is useful; it’s just not our final answer.

What about \(P(D)\)? This is the probability that \(D\) occurs, but without assuming any hypothesis in particular. Or, more accurately, with default degrees of belief in each possible hypothesis; degrees of belief held prior to examining the evidence \(D\). Similarly, \(P(H)\) is how strongly we believe \(H\) to be true in the absence of the data \(D\).

Let us say we are willing to consider the following 11 hypotheses: the coin shows heads show heads with one of the probabilities 0, 0.1, 0.2, …, 1. Furthermore we believe each possibility is equally likely.

First suppose our data \(D\) is 20 heads in 20 coin flips. As before, let \(H\) be the hypothesis that the coin is fair. We find:

\[ P(D) = \frac{1}{11} \sum_p p^{20} [p \in \{0, 0.1, ..., 1\}] \]

which is:

> sum[p^20 | p <- [0,0.1..1]] / 11

We have \(P(D|H) = 2^{-20}\), and \(P(H) = 1/11\), hence:

\[ P(H|D) = (1/11) \times 2^{-20} / 0.103... = 8.41... \times 10^{-7} \]

In other words, our belief that the coin is fair has dropped from \(1/11\) to less than one in a million.

Now suppose our data \(D\) is the sequence \(S\). This time:

\[ P(D) = \frac{1}{11} \sum_p p^{12} (1-p)^8 [p \in \{0, 0.1, ..., 1\}] \]

which is:

> sum[p^12*(1 - p)^8 | p <- [0,0.1..1]] / 11

Even though \(P(D|H)\) is again \(2^{-20}\), we find:

\[ P(H|D) = (1/11) \times 2^{-20} / (3.43... \times 10^{-7}) = 0.252... \]

Thus our belief that the coin is fair has increased from \(1/11\) to over \(1/4\).

The Bayesian approach has outdone my textbook. We get meaningful results without throwing away any information. We used the entire sequence, not just the number of heads. No inequalities were needed.

Wilfull Negligence

What if we discard information anyway, and only use the fact that exactly 12 heads were flipped? In this case, we find:

\[ P(D) = \frac{1}{11} \sum_p {20 \choose 12} p^{12} (1-p)^8 [p \in \{0, 0.1, ..., 1\}] \]

and \(P(D|H) = {20 \choose 12} 2^{-20}\). When computing \(P(H|D)\), the factor \({20 \choose 12}\) cancels out, and we arrive at the same answer. In other words, we’ve shown it’s fine to forget the particular sequence and only count the number of heads after all. What is not fine is doing so without justification.

It is also reassuring that using all available information gives an answer that is at least good as using only partial information (in this case, they agree). Contrast this to quasi-contraposition, which leads to nonsense if we focus on a particular sequence of flips.

What if we go further and introduce inequalities as before? The probability that we see at least 12 heads over all possible coins is:

\[ P(D) = \sum_k \frac{1}{11} \sum_p {20 \choose k} p^k (1-p)^{20-k} [p \in \{0, 0.1, ..., 1\}, 12 \le k \le 20] \]

And for a fair coin:

\[ P(D|H) = \sum_k {20 \choose k} 2^{-20} [12 \le k \le 20] \]

We find:

> let pd = sum[(ch 20 k)*p^k*(1-p)^(20-k) | p <- [0,0.1..1], k <- [12..20]] / 11
> pd
> let pdh = sum[(ch 20 k) / 2^20 | k <- [12..20]]
> pdh

and hence \(P(D|H)/P(D) = 0.578...\) which means \(P(H|D)\) is smaller than \(P(H)\). That is, the evidence weakens our belief that the coin is fair. Recall seeing exactly 12 heads strengthens our belief that the coin is fair, so by introducing an inequality, we discard so much information that our conclusion runs contrary to the truth.

The above is enough for me to shun my frequentist textbook and join the "Bayesian revolution":

  • It is natural to ask if given evidence strengthens or weakens a hypothesis, and by how much, rather than merely decide if a result is "significant". All else being equal, I’d choose the method that can handle this over the one that can’t.

  • We saw that discarding information can hurt our results. In our example, frequentism preserved enough data to lead to an acceptable conclusion, but do we trust it to work for other problems? How do we know it hasn’t thrown away too much data?

  • The frequentist approach fails to mirror the way I think. Frequentism is like doing taxes: a bunch of arbitrary laws and procedures which we follow to get some number that we hope is right.

  • The Bayesian approach matches my intuition, and feels like a generalization of logical reasoning.

  • The Bayesian approach forces us to be explicit about our assumptions, such as 11 equally likely hypotheses. With frequentism, somebody assumed something long ago, figured some stuff out, and handed us a distribution and a procedure. Who knows what the implicit assumptions are?

Further Reading

See Probability Theory: The Logic of Science by E. T. Jaynes. Laplace wrote: "Probability theory is nothing but common sense reduced to calculation". Jaynes explains how and why, though a vital step in his argument, Cox’s Theorem, turns out to require more axioms.

Ben Lynn