Original quote:
SOURCE Quote: http://blog.csdn.net/hguisu/article/details/7996185 Thanks
1. PageRank algorithm Overview
PageRank, which is the page rank , also known as the page level ,Google's left ranking or page rankings.
Google founder Larry Page and Sergey Brin in 1997 to build the early prototype of the search system, the link analysis algorithm, since Google's unprecedented commercial success, the algorithm has become the other search engine and academia are very concerned about the calculation model. At present, many important link analysis algorithms are derived from the PageRank algorithm. PageRank is one way Google uses to identify the level/importance of a Web page and is the only standard that Google uses to measure a website's quality. After blending all the other factors, such as the title logo and keywords logo, Google adjusts the results by PageRank to improve the relevance and quality of search results by making pages that are more "hierarchical/important" in search results. Its level is from 0 to 10 levels, and 10 is full-scale. A higher PR value indicates that the page is more popular (more important). For example, a site with a PR value of 1 indicates that the site is less popular, while a PR value of 7 to 10 indicates that the site is very popular (or extremely important). The general PR value reaches 4, even if it is a good website. Google has set the PR value of its website to 10, which suggests that Google is a very popular site, or that the site is very important.
2. From the number of incoming chains to PageRank
Before the PageRank, some researchers have proposed to use the number of pages into the link analysis calculation, such a chain method assumes that a Web page more into the chain, the more important the page. Many early search engines also adopted the chain number as a link analysis method, for the search engine effect promotion also has the more obvious effect. PageRank in addition to considering the impact of the number of the chain, but also reference the quality factors of the Web page, the combination of the two to obtain a better evaluation of the importance of Web page criteria.
For an Internet page A, the PageRank of this page is based on the following two basic assumptions:
? Quantity hypothesis: In a Web diagram model, this page is more important if one page node receives more of the number of incoming links.
? Quality hypothesis: The quality of the incoming links to page A is different, and high-quality pages pass more weights to other pages through the link. So the more high-quality pages point to page A, the more important page a is.
Using the two assumptions above, the PageRank algorithm has just begun to give the same importance score for each page, updating the PageRank score for each page node by iterating through recursive calculations until the score is stable. The result of PageRank calculation is the importance evaluation of the Web page, which has nothing to do with the query entered by the user, that is, the algorithm is subject-independent. Suppose there is a search engine, its similarity calculation function does not consider the content similarity factor, completely uses the PageRank to carry on the sorting, then this search engine performance is what looks like? The search engine returns the same results for any different query requests, that is, the page with the highest PageRank value is returned.
3. PageRank algorithm principle
PageRank's calculations take full advantage of two assumptions: the quantity hypothesis and the quality hypothesis. The steps are as follows:
1) In the initial phase: The Web page is built from a link relationship, each page is set to the same PageRank value, and the final PageRank value is obtained for each page through several rounds of calculations. As each round of calculations progresses, the current PageRank value of the Web page is constantly being updated.
2) Update page PageRank score calculation method in one round: in the calculation of the PageRank score of the update page, each page distributes its current PageRank value evenly to the out chain contained on this page, so that each link gets the corresponding weight. Each page adds a new PageRank score by summing all the weights passed into the chain that point to this page. When each page gets the updated PageRank value, it completes a round of PageRank calculations.
3.2 Basic Ideas:
If the page T has a connection to page A, it indicates that the owner of T is considered a more important, thereby assigning a part of the importance score of T to a. This importance score value is: PR (t)/L (t)
where PR (t) is the PageRank value of T, and L (t) is the number of chains of t
The PageRank value of a is an accumulation of a series of page importance score values similar to T.
That is, the number of votes on a page is determined by the importance of all links to its page, and a hyperlink to a page is equivalent to a vote on that page. The PageRank of a page is obtained by the recursive algorithm of the importance of all links to its page (linked into the page). A page with more links will have a higher level, but if a page does not have any links to the page, it has no hierarchy.
3.3 PageRank Simple calculation:
Suppose a collection consisting of only 4 pages: A,b,c and D. If all pages are linked to a, then A's PR (PageRank) value will be b,c and D.
Continue to assume that B also has links to C, and D also has links to 3 pages that include a. A page cannot be voted 2 times. So b gives each page a priceticket. With the same logic, D cast only one-third of the votes on the PageRank of a.
In other words, the PR value of a page is divided by the total number of links.
Example:
The example shown in Figure 1 illustrates the specific computational process of PageRank.
3.4 Fix PageRank Calculation formula:
Because there are some out of the chain of 0, that is, those who do not link to any other Web page, also known as orphaned pages, so that many Web pages can be accessed. Therefore, it is necessary to revise the PageRank formula, that is, the damping coefficient (damping factor)Q is added on the basis of simple formula, and the general value of Q is q=0.85.
The implication is that at any given moment, the probability that a user has reached a page and continues to navigate backwards. 1-q= 0.15 is the probability that the user stops clicking and randomly jumps to the new URL, which is used on all pages to estimate the probability that the page may be bookmarked by the surfer.
Finally, all of these are converted to a percentage and multiplied by the previous coefficient Q. Because of the following algorithm, no page pagerank will be 0. So, Google gives each page a minimum value through the math system.
The formula is. S Brin and L Page formulas defined in the anatomy of a large-scale hypertextual Web Search Engine computer Networks and ISDN Systems.
So the PageRank of a page is computed by the PageRank of the other pages. Google repeatedly calculates the PageRank of each page. If you give each page a random PageRank value (not 0), then after repeated calculations, the PR value of these pages will tend to normal and stable. That's why search engines use it.
4. PageRank Power method calculation (linear algebra application)
4.1 Complete formula:
For this section, you can read: The math behind Google
First, the complete formula is obtained:
Arvind Arasu in the Junghoo Cho Hector Garcia-molina, Andreas Paepcke, Sriram Raghavan. Searching the WEB is more accurately expressed as:
Is the page that is being researched, is the number of links in the page, is the number of links out of the page, and N is the number of all pages.
The PageRank value is a feature vector in a special matrix. This feature vector is:
R is a solution to the following equation:
If page I has a link to page J, then
otherwise = 0.
4.2 Using power method to seek PageRank
Then we pagerank the formula can be converted to the solved value,
Where the matrix is a = Qxp + (11 q) */n. P is the probability transfer matrix, which is all 1 rows of n-dimensional. Then =
The Power method calculation process is as follows:
X sets any initial vector, which is the initial PageRank value for each page. It is generally 1.
R = AX;
while (1) (
if (l X-r I <) {//If the result of the last two times is approximate or identical, return R
return R;
} else {
X =r;
R = AX;
}
}
4.3 Solution steps:
The computational process of a P probability transfer matrix:
First, establish a model of the link relationship between pages, that is, we need the appropriate data structure to represent the connection between the pages.
1) First we use the form of graphs to describe the relationship between pages:
Now assume that there are only four page collections: A, B, C, and its abstract structure, such as 1:
Figure 1 Link relationships between pages
Obviously, the graph is strongly connected (from either node, it can reach any other node).
2) We use matrices to represent connected graphs:
The vertex relation in this graph is represented by the adjacency matrix P, if the top (page) I to Vertex (page) j has a link condition, then Pij = 1, otherwise pij = 0. As shown in 2. If the total number of page files is n, then the web link matrix is an n x n matrix.
3) Web link probability matrix
Each line is then divided by the sum of the rows not 0 digits (the sum of the 0 numbers per line is the number of link nets) and the new matrix P ' is obtained, as shown in Figure 3 . This matrix records the probability that each page jumps to another page, where the value of the J column in I row indicates the probability that the user will go from page I to page J. Figure 1, a page chain to B, C, so a user from a jump to B, C, the probability of each is 1/2.
4) Probability transfer matrix P
The transpose matrix of P ' is used to calculate, that is, the probability transfer matrix p mentioned above. As shown in Figure 4 :
Figure 2 Web link matrix: Figure 3 Web link probability matrix:
Figure 4 Transpose matrix of P '
Second, a matrix calculation process.
1) p probability transfer matrix:
2)/N for:
3) A matrix:Qxp + (11 q) */n = 0.85xP + 0.15 */n
The initial PageRank value for each page is 1, which is x~t = (1, 1, 1).
Third, the process of cyclic iterative calculation PageRank
The first step:
Because x differs greatly from R. Continue the iteration.
Step Two:
Continue iterating this process ...
Until the last two times the results are approximate or identical, i.e. R eventually converges, R is approximately equal to X, at which point the calculation stops. The final R is the PageRank value of each page.
The calculation of the PageRank value by power method is always convergent, that is, the number of computations is limited.
Larry Page and Sergey Brin both theoretically prove that no matter how the initial value is selected, this algorithm guarantees that the estimate of the page rank can converge to their true value.
Since the number of pages on the internet is huge, the two-dimensional matrices mentioned above theoretically have multiple elements of the number of squares of pages. If we assume there are 1 billion pages, then this matrix has 10 billion elements. With such a large matrix multiplied, the computational amount is very large. Larry Page and Sergey Brin two people using sparse matrix computing techniques, greatly simplifying the computational capacity.
5. Advantages and disadvantages of PageRank algorithm
Advantages:
is a non-query-independent static algorithm, the PageRank value of all Web pages is obtained by offline calculation, effectively reducing the amount of computation in online query, greatly reducing the query response time.
Disadvantages:
1) People's query has the theme characteristic, PageRank ignores the topic correlation, causes the result the correlation and the topic reduces
2) The old page level will be higher than the new page. Because even a very good new page doesn't have a lot of upstream links, unless it's a subweb for a site.
Go PageRank algorithm