## The Markov and Chebyshev Inequalities

We intuitively feel it is rare for an observation deviate greatly from the expected value. Markov’s inequality and Chebyshev’s inequality place this intuition on firm mathematical ground.

I use the following graph to remember them.

Here, $n$ is some positive number. The blue line (the function that takes the value $0$ for all inputs below $n$, and $n$ otherwise) always lies under the green line (the identity function).

## Markov’s Inequality

Suppose $n$ is a positive integer. Consider the output of the above functions on inputs 0, 1, 2, … On the blue line, we have:

($n$ zeroes followed by an infinite sequence of $n$s), while on the green line we have:

Suppose the random variable $X$ takes nonnegative integer values. Let $p(i)$ be the probability of $i$ occuring. For both sequences, multiply the $i$th term by $p(i)$, and compare their sums:

In other words, we have *Markov’s inequality*:

The graph captures this inequality, and also makes it clear why equality is attained only when $p(i) = 0$ for all $i \ne 0, n$ (the only two points where the two functions agree).

The argument generalizes to any random variable that takes nonnegative values.

## Chebyshev’s Inequality

We start from Markov’s inequality:

Then we complain about the $n$ in the denominator. The bound is so weak. Can’t we do better than a denominator that increases only linearly? Why not at least quadratically? Why not $n^2$?

No problem! We simply square everything:

[Actually, we cannot square quantities arbitrarily. More precisely, we have replaced $n$ by $n^2$, and are now considering the random variable $X^2$ instead of $X$.]

But nobody likes $E[X^2]$; everybody prefers the related $E[(X - E[X])^2]$, aka
the *variance*. In general, we
prefer *central moments* to *raw moments* because they describe our data
independently of translation.

Thus we replace $X$ by $X - E[X]$ to get Chebyshev’s Inequality:

We let $\mu = E[X]$ and $\sigma = \sqrt{E[(X-E[X])^2]}$, and substitute $n$ with $\sigma n$ so it looks better:

Observe we can remove the restriction that $X$ is nonnegative: above, we view $X^2$ and $|X - \mu|$ as random variables, which are nonnegative even if $X$ is not.

## Higher Moments

A similar argument works for any power $m$ higher than 2. None of them are named after anyone, as far as I know. For any random variable $X$ and positive $n$, we have:

and:

## An Exponential Improvement

No matter how high the moment, the bound we get only decreases polynomially. But it turns out we can derive a bound that decreases exponentially by combining all the moments together, so to speak:

Under certain conditions, manipulating this weighted sum of all raw moments leads to the Chernoff bound.