Using MapReduce to implement the PageRank algorithm

Source: Internet
Author: User
Tags sort stdin first row

Reprinted from: http://www.cnblogs.com/fengfenggirl/p/pagerank-introduction.html


PageRank on the page ranking algorithm, was the magic of Google's wealth. Although there have been experiments before, but the understanding is not thorough, these days have looked again, here summarizes the basic principle of PageRank algorithm.

First, what is PageRank

PageRank page is considered a Web page, a page rank, or Larry Page (Google product manager), because he is one of the inventors of the algorithm, or Google CEO (^_^). The PageRank algorithm calculates the PageRank value of each page, and then sorts the importance of the page based on the size of the value. Its idea is to simulate a leisurely surfer, the surfer first randomly select a Web page to open, and then after a few minutes on this page, jump to the page point to the link, so do nothing, aimlessly on the Web page jump, PageRank is estimating the probability that this leisurely surfer is distributed across pages.

Second, the simplest PageRank model

Web pages in the Internet can be seen as a graph, where the web is a node, if page A has a link to page B, there is a forward edge a->b, here is a simple example:

In this example, there are only four pages, if the current in a page, then the leisurely surfer will be 1/3 of the probability of jumping to B, C, D, here 3 means a has 3 out of the chain, if a page has K bar out of the chain, then jump any one out of the chain probability is 1/k, the same probability of D to B, C 1 2, and the probability of B to C is 0. The transfer matrix is generally used to denote the jump probability of the surfer, and if N is the number of the Web page, the transfer matrix M is a n*n square; if the page J has K-out chain, then on each of the links to the page I, there is m[i][j]=1/k, and other pages of the m[i][j]= 0. The above example diagram corresponds to the transfer matrix as follows:

In the first interview, assuming that the probability of the Internet in each page is equal, that is, 1/n, so the probability distribution of the initial examination is a 1/n of all the values of the N-Willi Vector V0, with the V0 to the right by the transfer matrix M, and then get to the network of probability distribution vector MV0, (nXn) * (nX1) Still get a nX1 matrix. The following is the calculation process for V1:

Note that the matrix M m[i][j] does not represent 0 for the first row of a link from J to i,m multiplied by V0, indicating that the probability of accumulating all pages to page A is 9/24. Get V1, then use V1 to right multiply m to get V2, go down, eventually v will converge, that is VN=MV (n-1), the above diagram example, continuous iteration, final V=[3/9,2/9,2/9,2/9] ':

Iii. Point of termination

The behavior of the Internet users is a Markov process example, to meet the convergence, need to have a condition: the graph is strongly connected, that is, from any page can reach any other page:

Web pages do not meet the strong connectivity features, because there are some pages do not point to any Web page, if the above calculation, the surfer to reach such a Web page will be cornered, si gu dazed, resulting in the previous cumulative transfer probability is cleared, so go on, the final probability distribution vector all elements are almost 0. Suppose we throw away the link C to a in the above image, C becomes a terminating point, and we get the following figure:

The corresponding transfer matrices are:

Successive iterations, and eventually all elements are 0:

Iv. Trap Issues

Another problem is the trap problem, where some pages do not have links to other pages, but there are links to themselves. For example, the following diagram:

The surfer ran to the C Web page, just like jumping into the trap, into the vortex, can no longer come out from C, will eventually lead to the probability distribution of the value of all transferred to C, which makes the other pages of the probability distribution value of 0, thus the entire page ranking lost meaning. If the transfer matrix corresponds to the diagram above:

As the iteration continues, it becomes this:

V. Solving the problem of termination points and pitfalls

The above process, we overlooked a problem, that is, the surfer is a leisurely surfer, rather than a stupid surfer, our internet users are smart and leisurely, he leisurely, aimless, always randomly select the page, he is smart, go to an end page or a trap page (such as two examples of C), Not silly anxious, he will be in the browser address randomly entered an address, of course, this address may be the original page, but here gave him a chance to escape, let him leave this abyss. Simulation of intelligent and leisurely internet users, to improve the algorithm, every step, the surfer may not want to see the current page, do not see the current page will not click on the connection above, and quietly in the address bar to enter another address, and in the address bar input and jump to each page the probability is 1/n. Assuming that the Internet users each step to view the current page probability is a, then he jumps from the browser address bar probability is (1-a), so the original iteration formula into:

Now let's calculate the probability distribution of a Web map with traps:

Repeat the iteration and get:

You can see that C, although a large part of the PageRank value, but other page pages get some value, so the link structure of C, its weight should be larger.

Vi. using Map-reduce to calculate page Rank

The above calculation process, using matrix multiplication, iterative, until the iteration before and after the probability distribution vector values change is not small, the general iterative to more than 30 times convergence. The real web structure of the transfer matrix is very large, the current number of pages has more than 10 billion, the transfer matrix is 10 billion * 10 billion matrix, directly by matrix multiplication calculation method is not feasible, need to use Map-reduce calculation method to solve. In fact, Google invented map-reduce originally for distributed computing large-scale web pages Pagerank,map-reduce PageRank There are many ways to implement, I calculate a simple.

Consider the transfer matrix is a lot of sparse matrix, we can use the form of sparse matrix, we put each page in the Web diagram and its linked page as a row, so the fourth section of the Web diagram structure is represented as follows:

1 a    b    c    d
2 B    A    d
3 C    C
4 D    b    C

A There are three out of the chain, pointing to B, C, D, in fact, we crawl the page structure data is this.

1. Map Stage

Each row of the map operation, the 1/k,k of the current page's probability value for all outgoing links is the number of links to the current page, such as the first line of output <B,1/3*1/4>,<C,1/3*1/4>,<D,1/3*1/4>;

2. Reduce phase

The reduce operation collects the same value as the page ID, accumulates and calculates by weight, pj=a* (p1+p2+ ... PM) + (1-a) *1/n, where M is the number of pages J that points to page J, n all pages.

The idea is so simple, but in practice, how to know the current line of the probability of the page in the map stage, need a separate file to save the last round of probability distribution value, first to order, let the chain line and probability value in the same mapper, the whole process is as follows:

Doing this one iteration is equivalent to two mapreduce, but the first mapreduce is just a simple sort, without any action, using Python to invoke the streaming of Hadoop.

The sortmappert.py code is as follows:

1 #!/bin/python
2 "Mapper for Sort"
3 import sys 4 for line in
Sys.stdin:
5      print Line.strip ()

Sortreducer.py is the same.

1 #!/bin/python
2 "Reducer for Sort"
3 import sys 4 for line in
Sys.stdin:
5       print Line.strip ()

pagerankmapper.py Code:

1 "Mapper of Pangerank algorithm"
 2 import sys
 3 id1 = Id2 = None
 4 Heros = value = None
 5 Count1 = C Ount2 = 0
 6 7 for line in 
 Sys.stdin:
 8     data = Line.strip (). Split (' \ t ')
 9     If len (data) = = 3 and dat A[1] = = ' A ': # This is the Pangerank
value         count1 + = 1 each         if count1 >= 2
:             print '%s\ t%s '% (id1,0.0)         id1 = data[0] [         value = float (data[2]) +     else: #This the Link relation         id2 = data[0]         heros = data[1:]     if id1 = = Id2 and id1:         v = Value/len (heros)
for         hero in heros:             print '%s\t%s '% (hero,v)         print '%s\t%s ' % (id1,0.0)         id1 = Id2 = None         count1 = 0

pagerankreducer.py Code:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.