## 2012年4月14日星期六

### 淺談 Google 的 PageRank 算則

Google 的搜尋結果排名法，稱為 PageRank。經過多年發展，現時 PageRank 算法的內容已經不為人知，不過 Larry Page 與 Sergey Brin （Google 兩位創辦人）起初是有將描述此算則的論文擺上網的，有興趣的讀者不妨看一看。

• 儘管圖中 A, B 兩個網頁都各得一票，但 B 網頁應該比 A 網頁重要，原因是投票給 B 的 b 網頁有人引用，投票給 A 的 a 網頁卻無。投票者較權威，得票者亦應該更權威。
• 儘管 B, C 均各得一票，但 B 的排名應該比 C 高，原因是投票給它們的 b, c 兩個網頁同樣權威，但是 b 網頁將票全投給 B，而 c 網頁卻投了三個網頁，所以 C 網頁未必如 B 網頁那麼有用。

$p_i = \sum_{\textrm{page } j \textrm{ has link to page } i} \frac{p_j}{\textrm{no. of out-links of page } j}.$

Google 最初期的 PageRank，其實還有另一步 ── 它實際上將 $p_i$ 定義為
$p_i = \frac{1-d}{n} + d\left(\sum_{\textrm{page } j \textrm{ has link to page } i} \frac{p_j}{\textrm{no. of out-links of page } j}\right).$

PageRank 的加權平均方法，其實是有理論基礎的。假設 $C_j$ 代表網頁 $j$ 有幾多條 out-links，而 $g_{ij}$ 代表「網頁 $j$ 引用了網頁 $i$」的布林值，亦即當網頁 $j$ 引用了網頁 $i$ 的時候，$g_{ij}$ 為 $1$，否則為 $0$。於是我們可以將 $p_i$ 的定義以矩陣形式寫成
$\begin{eqnarray} \mathbf{p} &=& \left[\frac{1-d}{n} \begin{pmatrix}1&\ldots&1\\\vdots&&\vdots\\1&\ldots&1\end{pmatrix} + d\mathbf{G}\begin{pmatrix}\frac{1}{C_1}\\&\ddots&\\&&\frac{1}{C_n}\end{pmatrix}\right] \mathbf{p}\\ &=& \mathbf{A}\mathbf{p} \quad\textrm{ (say)}. \end{eqnarray}$

#### 5 則留言:

Chainsaw Riot 說...

The suffocated 說...

I totally agree with you that the intuition and gut feels separate the geniuses from the work mules. It is like comparing Thomas Edison vs. Nicholas Tesla. The former used brute force and dirty business practices to become successful and the latter was just purely a genius who was suppressed and shot down by rivals. Edison said Genius was 1% inspiration and 99% perspiration. That pretty much proved that he was not a true genius because Tesla probably was 99% inspiration and 1% perspiration. If Tesla had gotten funding for his research, the earth would be a different world today.

I still remembered my first exposure to Google search years ago. I was using MetaCrawler and Yahoo search at the time. I usually had to go through 5 to 10 pages of results to finally find what I was looking for. One day a friend told me to give Google a try. What I sought showed up as No 1 result as if the search engine could read my mind. On that day, I removed my bookmark on MetaCrawler and Yahoo search.

I think knowing how to rank the pages is one thing. Figuring a way to actually implement the idea is another thing. There are so many webpages on the Internet. How can one distill the world of Internet into some ranking numbers quickly to keep the ranking up to date. I believe MapReduce enabled that. Today, Hadoop is an Open Source version used by many company for other applications.