Data Mining 10 Big algorithm (1)--pagerank

Source: Internet
Author: User

This series of articles mainly about the 2006 data Mining 10 large algorithms (see Figure 1). The focus of this article will be on the source of the algorithm and the main idea of the algorithm, and does not involve the concrete implementation. If there is any mistake in the text, I hope you will point it out and discuss it together.

Figure 1 the article from Idmer

Of these algorithms, the most compelling nature is one of Google's core technologies,--pagerank. So this series is the first to explore the birth process of PageRank.

2. Core Ideas

As the saying goes, look at a person, see what he has friends to know. In other words, the more a person has a friend of the Ox X, the more likely he is to be a bull X. The transfer of this knowledge to the Web page is "the more high-quality web page refers to the page, it is high-quality probability is greater ."

The core idea of PageRank is this simple but effective view. By this idea, you can get an intuitive formula:

(1)

R (x) indicates that x's Pagerank,b (x) represents all pages pointing to X.

The formula (1) means that the importance of a webpage is equal to the importance of all the pages that point to it. Under the rough view, the formula (1) expresses the core idea accurately. But a closer look will reveal that the formula (1) has a flaw: no matter how many hyperlinks J has, as long as J points to the I,i will get the same importance as J. When J has more than one hyperlink, the idea creates an unreasonable situation. For example: a newly opened site N has only two hyperlinks to it, one from the famous and historic portal F, and the other from the Unknown website U. According to the formula (1), we will get the conclusion that N is better than F. This conclusion clearly does not accord with people's common sense.

An easy way to compensate for this flaw is when J has multiple hyperlinks (assuming the number is N), each link gets the importance of R (j)/N. So the formula (1) becomes:

(2)

N (j) indicates the number of hyperlinks on the J page

Figure 2 The article from Lawrence Page

As can be seen from Figure 2, if you want to get a better quality of n than F, it requires N to get hyperlinks to many important websites or to a huge number of unknown sites. And that is acceptable. Therefore, it can be thought that the formula (2) accurately expresses the core idea. In order to obtain the standardized calculation results, a constant c is added on the basis of the formula (2) to obtain the formula (3):

(3)

3. calculation

The formula (3) shows that PageRank is a recursive definition. In other words, to get a page pagerank, you should first know the PageRank of other pages. Therefore, you need to set a reasonable PageRank initial value. However, if there is a way to get a reasonable PageRank initial value, do you still need this algorithm? Or is this an algorithm that relies heavily on the initial value for what it means?

The PageRank algorithm, which relies on reasonable initial values, is meaningless, so the PageRank algorithm that does not depend on the initial value is meaningful. That is, if there is a method of calculation, so that regardless of how to set the initial value, the end will converge to the same value on the line. To do so, it is necessary to look at the problem in a different perspective, from the point of view of linear algebra.

Consider a page as a node, and a hyperlink as a forward edge, and the entire Internet becomes a graph. At this point, with the adjacency matrix M for the entire Internet, if the page I has a hyperlink to the first page, then the matrix element m[i][j]=1, otherwise m[i][j]=0. For Figure 3 There are

Rectangular m={0, 1, 1, 0,

0, 0, 0, 1,

1, 0, 0, 0,

1, 1, 1, 0}

Figure 3

Observation matrix M can be found, m of the line I represents the page I page point, M of the J column represents a page pointing to J. If you divide each element of M into the sum of all the elements of the row, and then transpose m (Swap rows and columns), get Mt. The sum of all elements of each row of MT is exactly the formula (3)? Example 3 can get a matrix like this:

mt={0, 0, 1, 1/3,

1/2, 0, 0, 1/3,

1/2, 0, 0, 1/3,

0, 1, 0, 0}

Consider R as a matrix of n rows and 1 columns, and the formula (3) becomes

R = C MT r (4)

In the formula (4), R can be regarded as Mt Eigenvector, its corresponding eigenvalues are 1/c (see this sentence, can recall the linear algebra in the definition of eigenvectors)-for matrix A, if there is a column vector x and a number C, so that ax=cx, then X is called a eigenvector, C is called a eigenvalue. Power method calculates the principal eigenvector independent of the initial value, so as long as the R is considered as the main feature vector calculation, it can solve the problem of the reasonable setting of the initial value.

The result of the power method is independent of the initial value, because it eventually converges to a certain value. So before using the power method, be sure to be able to converge. However, in the Internet hyperlink structure, once the closed situation, it will make the power method can not converge. The so-called closure refers to a number of pages pointing at each other, but does not point to other pages, the specific example 4 is shown:

Figure 4来 ppt from Soumya Sanyal

4 green pages are a closed case. This situation causes the PageRank of these pages to accumulate continuously at the time of calculation, thus making the results not convergent. Careful study will find that the red pages of the PageRank to the Green page, the green page will be PageRank swallow. Larry page calls this situation rank Sink.

What if the links along the page keep going and find that you're always wandering around the same few pages? Yes, turn off the current page and open a new page. This is just like rank Sink, which means you can use this idea to solve rank Sink. Therefore, in the formula (3) based on the addition of an escape factor E, get:

(5)

E (i) means of escape factor for page I

To turn (5) into a matrix form, you can get:

R = C MT r + CE = C (Mt R + E)

Where the 1 norm of the column vector R (that is, the total matrix element of R is added) is 1

Rewrite the above as

R = C (MT + E * 1) R (6)

1 is a row vector of n columns, and each element is 1

In the formula (6), as long as the R is considered (MT + E * 1) eigenvector, it is possible to resolve both the initial value setting problem and the closed case.

4. data sharing

It is simple to find information, but it is not so easy to find good information. Therefore, this section is to share some of the better information I have found.

1. PageRank's father's article: the PageRank citation Ranking bringing Order to the Web

http://ilpubs.stanford.edu:8090/422/

2. An explanation of the PageRank PPT, explained very well: the PageRank citation Ranking–redone

http://wenku.baidu.com/view/30657568a98271fe910ef975.html?from=related

3. Good PageRank Copvin: Google's Secret-PageRank thorough explanation Chinese version

Http://www.kreny.com/pagerank_cn.htm

4. Related knowledge of linear algebra used in this paper

http://ceee.rice.edu/Books/LA/eigen/

Data Mining 10 Big algorithm (1)--pagerank

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.