# The Markov and Chebyshev Inequalities

We intuitively feel it is rare for an observation to deviate greatly from the expected value. Markov’s inequality and Chebyshev’s inequality place this intuition on firm mathematical ground.

I use the following graph to remember them.

Here, \(n\) is some positive number. The blue line (the function that takes the value \(0\) for all inputs below \(n\), and \(n\) otherwise) always lies under the green line (the identity function).

## Markov’s Inequality

Suppose \(n\) is a positive integer. Since the blue line lies under the green line:

Suppose the random variable \(X\) takes nonnegative integer values. Let \(p(i)\) be the probability of \(i\) occuring, and multiply the \(i\)th term by \(p(i)\):

In other words, we have *Markov’s inequality*:

The graph captures this inequality, and also makes it clear why equality is attained only when \(p(i) = 0\) for all \(i \ne 0, n\) (the only two points where the two functions agree).

The argument generalizes to any random variable that takes nonnegative values.

## Chebyshev’s Inequality

We start from Markov’s inequality:

Then we complain about the \(n\) in the denominator. The bound is so weak. Can’t we do better than a denominator that increases only linearly? Why not at least quadratically? Why not \(n^2\)?

No problem! We simply square everything:

[Actually, we cannot square quantities arbitrarily. More precisely, we have replaced \(n\) by \(n^2\), and are now considering the random variable \(X^2\) instead of \(X\).]

But nobody likes \(E[X^2]\); everybody prefers the related \(E[(X - E[X])^2]\), aka
the *variance*. In general, we
prefer *central moments* to *raw moments* because they describe our data
independently of translation.

Thus we replace \(X\) by \(X - E[X]\) to get Chebyshev’s Inequality:

We let \(\mu = E[X]\) and \(\sigma = \sqrt{E[(X-E[X])^2]}\), and substitute \(n\) with \(\sigma n\) so it looks better:

We can remove the restriction that \(X\) is nonnegative: the random variables \(X^2\) and \(|X - \mu|\) are nonnegative even if \(X\) is not.

## Higher Moments

A similar argument works for any power \(m\) higher than 2. None of them are named after anyone, as far as I know. For any random variable \(X\) and positive \(n\), we have:

and:

## An Exponential Improvement

No matter how high the moment, the bound we get only decreases polynomially. But it turns out we can derive a bound that decreases exponentially by combining all the moments together, so to speak:

Under certain conditions, manipulating this weighted sum of all raw moments leads to the Chernoff bound.

*blynn@cs.stanford.edu*💡