The Power Method

Recall that, inspired by fixed-point theorems, we repeatedly applied a linear map to argue that eigenvectors exist.

Might this idea be as useful in practice as it is in theory? Let’s examine a simple case. Let \( \newcommand{\x}{\mathbf{x}} \newcommand{\v}{\mathbf{v}} \newcommand{\e}{\mathbf{e}} D\) be a diagonal matrix, and let \(\lambda_1, …​, \lambda_n\) be the entries on the diagonal. Take any vector \(\x\) written with respect to the standard basis:

\[ \x = c_1 \e_1 + …​ + c_n \e_n \]

We have:

\[ D^k (c_1 \e_1 + …​ + c_n \e_n) = \lambda_1^k c_1 \e_1 + …​ + \lambda_n^k c_n \e_n \]

Suppose \(|\lambda_1| > |\lambda_i| \) for all \(i > 1.\) Then as \(k\) increases, the first term dominates:

\[ D^k \x = \lambda_1^k \left(c_1 \e_1 + \left(\frac{\lambda_2}{\lambda_1}\right)^k c_2 \e_2 + …​
\left(\frac{\lambda_n}{\lambda_1}\right)^k c_n \e_n\right) \approx \lambda_1^k c_1 \e_1 \]

Hence starting from an arbitrary \(\x\) with \(c_1 \ne 0\), repeatedly applying \(D\) and normalizing the output vector converges to the eigenvector \(\e_1\).

More generally, suppose \(n\times n\) matrix \(T\) is diagonalizable, and furthermore, one of its eignenvalues \(\lambda\) is the dominant eigenvalue, that is, \(|\lambda|\) is larger than the magnitude of any other eigenvalue.

Let \(\v\) be a unit eigenvector corresponding to \(\lambda\) Then given an approximation \(\x\) to \(\v\), we can refine it by computing:

\[ \x' = \frac{T \x}{|T \x|} \]

jsEval "curl_module('Matrix.ob')"
import Matrix

warm t x = map (recip (norm y) *) y where
  y = matrixFun t x

This can be seen by replacing \(\e_1,…​,\e_n\) in the equations above with unit eigenvectors \( \v, \v_2, …​, \v_n\) corresponding to the eigenvalues \(\lambda, \lambda_2, …​, \lambda_n\). We have:

\[ \x = c \v + c_2 \v_2 + …​ + c_n \v_n \]

for some \( c, c_2, …​, c_n\). Then since \(T\) is diagonal with respect to the unit eigenvectors, iterating the above computation \(k\) times yields:

\[ \frac{T^k \x}{|T^k \x|} = \frac{\lambda^k c \v}{|\lambda^k c \v|} + \frac{\lambda^k c_2 \v_2}{|\lambda^k c_2 \v_2|} + …​ + \frac{\lambda^k c_n \v_n}{|\lambda^k c_n \v_n|} \]

The first term dominates the others, because as \(k\) increases \(|\lambda_i^k| / |\lambda^k|\) approaches \(0\) whenever \(\lambda_i < \lambda,\) thus:

\[ \frac{T^k \x}{|T^k \x|} \approx \frac{\lambda^k c \v}{|\lambda^k c \v|} \]

We assume \(c \ne 0\), which is likely if \(\x\) is randomly chosen. Since \(|\v| = 1\), when \(\lambda, c\) are real, the right-hand side is \(\pm \v\). In particular, when \(\lambda \lt 0\), each iteration flips the sign, though still brings \(\x\) closer to one of \(\pm \v\).

Recall the matrix describing the Fibonacci recurrence:

rabbit = [ [1, 1]
         , [1, 0]
	 ]

Starting from \( (1.2, 3.4)\), we rapidly approach an eigenvector:

powerMethod t x = iterate (warm t) x

take 8 $ powerMethod rabbit [1.2, 3.4]

We can recover the approximation for the dominant eigenvalue from an approximate corresponding eigenvector \(\x\) via:

\[ \lambda \approx \x^T T \x \]

eigenpair t x = (head $ head $ matrixMul [x] $ matrixMul t $ transpose [x], x)

eigenpair rabbit $ powerMethod rabbit [1.2, 3.4] !! 20

We’ve confirmed that the dominant eigenvalue is the golden ratio \( (1 + \sqrt{5}) / 2. \)

In the complex case, we have \(e^{i\theta} \v\) instead of \(\pm \v\), and we recover the eigenvalue via:

\[ \lambda \approx \x^* T \x \]

Inverse Power Method

If \(\v\) is an eigenvector of two matrices \(A, B\) with corresponding eigenvalues \(a, b\), then by linearity \(\v\) is an eigenvector of \(A + B\) with corresponding eigenvalue \(a + b:\)

\[ (A + B) \v = A \v + B \v = a \v + b \v = (a + b) \v \]

Also, if \(A\) is invertible, then we have:

\[ \v = A^{-1} A \v = A^{-1} a \v \]

thus \(\v\) is also an eigenvector of \(A^{-1}\) with corresponding eigenvalue \(a^{-1}\).

Then for any non-eigenvalue scalar \(\mu\), an eigenpair \((\lambda, \v)\) of \(T\) corresponds to the eigenpair \(((\lambda - \mu)^{-1}, \v)\) of \((T - \mu)^{-1}\), where \( T - \mu \) is invertible because \(\mu\) is not an eigenvalue by assumption.

If \(\mu\) is closer to \(\lambda\) than any other eigenvalue of \(T\), that is, among all the eigenvalues, the magnitude \( |\lambda - \mu| \) is smallest, then its corresponding eigenvalue \((\lambda - \mu)^{-1}\) of \( (T - \mu)^{-1} \) is the largest eigenvalue, so we can use the above power method to find \(\v .\)

For example, zero is closer to the non-dominant eigenvalue of the rabbit matrix, so we can set \( \mu = 0 \) to find it:

inversePowerMethod t mu x = powerMethod t' x where
  Matrix t' = recip $ Matrix t - Scalar mu

eigenpair rabbit $ inversePowerMethod rabbit 0 [1.2, 3.4] !! 20
wp0 = [ [2, 0, 0]
      , [0, 3, 4]
      , [0, 4, 9]
      ]

Power iteration finds the dominant eigenvalue:

eigenpair wp0 $ powerMethod wp0 [1,2,3] !! 24

On a whim, we try \( \mu = 0: \)

eigenpair wp0 $ inversePowerMethod wp0 0 [1,2,3] !! 24

One more to find. Let’s try \( \mu = 5, \) roughly halfway between the eigenvalues we’ve found so far:

eigenpair wp0 $ inversePowerMethod wp0 5 [1,2,3] !! 24

Despite its simplicity, the power method easily found all the eigenvalues (1, 2, 11) of the example matrix.

PageRank

A billion monkeys with a billion computers each do the following:

  1. Visit a random webpage.

  2. Roll a 20-sided die:

    • On 1, 2, or 3: go back to step 1.

    • Otherwise, follow a random link on the page, and go to back to step 2. If there are no links on the page, go back to step 1.

What happens as time passes?

If a webpage has inbound links from many other webpages, then it will likely be visited often. Conversely, a webpage with few inbound links is unlikely to be visited. In the extreme case, if a webpage has no inbound links, then it is only visited when chosen randomly in step 1.

Suppose the monkeys' actions are synchronized. Then the webpages being visited turns out to converge to a steady state in the sense that the expected number of monkeys visiting a given webpage approaches a constant. Dividing this constant by the number of monkeys yields the proportion of the population visiting a webpage at once, and is known as its PageRank score.

We can compute the PageRank of all webpages by using the power method to find the eigenvector of the normalized adjacency matrix of the web.

In more detail, let \(n\) be the number webpages. For simplicity, we process them so that there are no self-links, and there is at most one link to any other page. Let \(T\) be a \(n \times n\) matrix, and let \(a_{ij}\) denote the entry at row \(i\) and column \(j\). Set \(a_{ij}\) to the probability that a monkey currently visiting website \(j\) will next visit website \(i\).

Then if the webpage \(j\) contains \(k\) links and one of them goes to \(i\), we have:

\[ a_{ij} = 0.15 / n + 0.85 / k \]

Otherwise:

\[ a_{ij} = 0.15 / n \]

Let’s take an example from Wikipedia’s entry on PageRank, whose weighted adjacency matrix is:

wp1 = transpose $ map parse $ lines [r|A M
B A
C AD
D F
E CDG
F DE
G F
H GI
I HJ
J IL
K IJ
L KM
M N
N LM
|]
  where
  parse r = let
    [_, tgts] = words r
    w = 1.0 / fromIntegral (length tgts)
    in [if c `elem` tgts then w else 0 | c <- ['A'..'N']]

wp1

Since we list the the direct successors of each node, we need to transpose to get the matrix we want.

We adjust the weights to accomodate the 15% chance of visiting a random internet page, rather than clicking a random link:

dampen d m = map go <$> m where
  w = recip $ fromIntegral $ length m
  go x = (1 - d) * w + d * x

dampen 0.85 wp1

If \(\x\) is an \(n\)-vector whose \(i\)th entry is the expected proportion of the population currently visiting webpage \(i\), then \(T \x\) describes the expected proportion of the population visiting each webpage after the next step. We observe the long-term behaviour by repeatedly applying \(T\).

Strictly speaking, this differs to our description of the power method above, because we skip the normalization step. But another way to view it is that we’re applying the poewr method with the \(L^1\) norm instead of the usual \(L^2\) norm, which suits Markov models.

Since each entry of \(T\) is positive, and since each of its rows describes a probability distribution, it is easy to show that \(T\) has a dominant eigenvalue of 1, thus power iteration is guaranteed to find the PageRank score of each webpage.

pageRank m = iterate (matrixFun $ dampen 0.85 m) iv where
  iv = replicate n $ recip $ fromIntegral n
  n = length m
pageRank wp1 !! 32

This should agree with the numbers shown on the bar chart.

Let’s try the second example from Wikipedia. The node A lacks out-edges; recall this means we treat it as a webpage with a link to every webpage, including itself.

This time, we list the direct predecessors, which turns out to require one transposition so we can easily weight the links by outdegree, and another transposition to restore the matrix to the form we want.

wp2 = weight $ (++ replicate 5 (replicate 11 0)) $ map parse $ lines [r|A D
B CDEFGHI
C B
D E
E FGHIJK
F E
|]
  where
  parse r = let
    [_, tgts] = words r
    in [if c `elem` tgts then 1.0 else 0 | c <- ['A'..'K']]

weight m = transpose $ go <$> transpose m where
  go row
    | d == 0 = const (recip $ fromIntegral $ length m) <$> row
    | otherwise = (recip d *) <$> row
    where d = sum row

wp2

We print the PageRank as percentages, which should match the figures given on Wikipedia:

(100*) <$> pageRank wp2 !! 32

I first got internet access in the 1990s. I usually searched the internet with AltaVista, though I’d often also try other sites like Yahoo and Excite because good results were hard to find. Then in 1998, at a local ACM programming contest, another competitor told me about Google, and not long afterwards, it became my search engine of choice.

In those days, search engines seemed to rank pages based on their content, and so were gamed via simple tricks like keyword stuffing. Google’s PageRank handily defeated this sort of nonsense, yielding results of higher quality.


Ben Lynn blynn@cs.stanford.edu 💡