study and collation of PageRank algorithmbecause recently in the study diagram computation frame related question, decided to review to tidy up its test algorithm PageRank, the insufficiency is please everybody criticizes correct! first, PageRank related backgroundPageRank's founder, Larry Page, and Sergei Brin (Sergey Brin), introduced the algorithm in 1998 and applied it to the search results of Google's search engine, a technology that was one of Google's early core technologies. Is the standard that Google uses to measure a website's quality.
second, Google search engine workflowfirst look at the Google search page process, as shown in:
How did Google sort the pages it searched for? Search for "network Reds" in Google, the search engine brief process is as follows:1. For the query word "network red" for Word segmentation: "Network" and "Red". 2. Return documents containing both "network" and "Red", based on the inverted index, and sort by relevance(the correlation here is mainly based on the relevance of the content, but there will be some spam pages, although there are a lot of query words, but not the user needs, so the importance of the page itself in the page sort also plays a very important role)
So how do you measure the importance of the Web page itself? every HTML document on the Internet, in addition to the text, pictures, videos and other information, also contains a large number of connections, the use of these connections can be used to discover some important web pages.
consider a Web page as a node, and a connection relationship as a forward edge. Intuitive discovery: links to B and E pages more, then should be B and e pages more important, but link to C's page B is very important, so the importance of C page is also very high. Therefore, we can qualitatively determine the importance of the Web page method:1. The more times a webpage is linked, the greater its importance. 2. Web pages linked by highly-regarded web pages are also of high importance.
How to quantitatively measure the importance of Web pages? Next we're going to introduce PageRank .
third, what is PageRank? PageRank is a technique for calculating page rank based on the links between pages in search engines. Google uses this technology to mark the level or importance of a webpage. PageRank levels from Level 1 to level 10, the higher the PR value, the more popular (that is, the more important) the page. PageRank is approximate to a user, which is a probability that the random click of a link on the Internet will reach a particular webpage. In general, pages that can be reached from more places are more important and therefore have higher PageRank. View the PageRank value of a page to install the Google Toolbar and enable the PageRank feature, or install the Serchstatus plugin in Firefox, or query in http://pr.chinaz.com.
Iv. core ideas of PageRankPageRank is based on the regression relationship of "Web pages from many high-quality web links, must be quality pages," to determine the importance of all Web pages. Its core ideas are two points:1. If a webpage is connected to many other Web pages, it is important to note that the PageRank value is relatively high. 2. If a Web page with a high PageRank value is connected to a different Web page, the PageRank value of the page being linked to will be increased accordingly.
v. PageRank simple model of calculationThe following is a quantitative calculation of the importance of Web pages, which is the calculation of PageRank values. 1. PageRank Simple calculation ModelSuppose you have a collection of only four pages: A,b,c,d. If all the pages are linked to a, then A's PR value will be b,c,d and:
Continue to assume that B also has links to C, and D also has links to three pages that include a. Each link is equivalent to a vote on the importance of the page to which it is attached, and a page cannot vote two times. So B throws priceticket for each page. With the same logic, D throws only one-third to a, so
In other words, by dividing the PR value of a page by the total number of links, you get the general formula:
where PR (a) represents the PageRank value of page A, and L (b) Represents the total number of page B links.
PageRank Simplification Model: You can think of the link relationship between pages on the Internet as a graph. Let's say the next page the surfer browses to is from the current page. Establish a simplified model: for any page pi, its PageRank value can be expressed as follows:
PRi: PageRank value of page I prj: PageRank value of page J LJ for Web page J external link number
Bi is a collection of all pages linked to page I
In the actual network hyperlink environment is not so idealized, PageRank will face two problems: Rank leak and rank Sink.
1. Rank Leak: A separate Web page generates a level leak if there is no out-of-office link. Solution: (1) The non-out of the node is recursively removed, waiting for the other nodes to be calculated after the addition. (2) Add an edge to the non-out node, pointing to the vertices that point to it. 2. Rank Sink: A group of tightly linked looping pages in the entire page map will produce Rank Sink if there are no outgoing links
Vi. PageRank's random browsing model assumes that an Internet user starts browsing from a random Web page, and the surfer keeps clicking on the link of the current page to start the next browsing. However, the surfer eventually got bored and started a random Web page. The probability of the Internet users accessing a new webpage in the above way is equal to the PageRank value of this page. This stochastic model is closer to the user's browsing behavior, which solves the problem of rank leak and rank Sink to a certain extent, ensuring that PageRank has a unique value. Random view of the representation of a model diagram:
Set any of the two vertices to have a direct path between each vertex by the probability d in accordance with the original blue direction, with the probability 1-d in red direction.
Because of the huge number of Web pages, the adjacency matrix of the links between Web pages is a very large sparse matrix, using adjacency table to indicate the connection relationship between pages, and randomly browse the PageRank formula of the model:
N: The total number of pages in the network Q: Damping factor, usually set to 0.85,q that is, the probability of browsing by hyperlink 1-q: Random jump to a new page probability PR (PJ): Page PJ's PR value L (PJ): Page PJ's link out page number
The PageRank of a page is computed by the PageRank of the other pages. Google repeatedly calculates the PageRank of each page. If you give each page a random pagerank (not 0), because the equation r=m*r satisfies the nature of the Markov chain, if Markov converges, then R has a unique solution. The PageRank values of all nodes are computed by iteration. After repeated calculations, the PR value of these pages tends to be normal and stable.
The PageRank value is a feature vector in a special matrix, which is:
The map corresponding to the page transfer matrix is:
Each row represents the total number of outbound links for the specified node, as the first row represents a link to Node B and c for Node A. The column represents the total number of inbound links for the specified node, such as the first column, which indicates that only C nodes point to a node
Suppose we browse the page while surfing the Internet and select the next page, which has nothing to do with what pages have been viewed in the past, but only depends on the current page, then this selection process can be considered as a stochastic process with finite state and discrete time, and its state transfer rule is described by Markov chain.
According to the basic properties of Markov chains, there is a stable distribution for regular Markov chains, which satisfies:
Specific examples are as follows: Because the PR value is averaged to each link, we divide each row of the matrix by the number of connections per column to get the following new matrix:
Assuming that the initial PR value for each node is 1, there is the following matrix
Where rows represent each node.
Then transpose the first matrix to get
Multiply the transpose matrix and the initial R matrix to get the matrix
Multiply m ' t to get a convergence matrix, the final matrix is the PR value of each node.
Study and collation of PageRank algorithm