Introduction to the PageRank algorithm

Source: Internet
Author: User
Tags stdin

There are two articles in one explanation (copy below) "PageRank algorithm introduction and map-reduce implementation" Source: http://www.cnblogs.com/fengfenggirl/p/pagerank-introduction.html

Another "PageRank Introduction-Chuanjiang q&a.docx" http://docs.babel.baidu.com/doc/ee14bd65-ba71-4ebb-945b-cf279717233b

PageRank on the page ranking algorithm, was the magic of Google's wealth. Although there have been experiments before, but the understanding is not thorough, these days have looked again, here summarizes the basic principle of PageRank algorithm.

First, what is PageRank

PageRank page is considered a Web page, a page rank, or Larry Page (Google product manager), because he is one of the inventors of the algorithm, or Google CEO (^_^). The PageRank algorithm calculates the PageRank value of each page, and then sorts the importance of the page based on the size of the value. Its idea is to simulate a leisurely surfer, the surfer first randomly select a Web page to open, and then after a few minutes on this page, jump to the page point to the link, so do nothing, aimlessly on the Web page jump, PageRank is estimating the probability that this leisurely surfer is distributed across pages.

Second, the simplest PageRank model

Web pages in the Internet can be seen as a graph, where the web is a node, if page A has a link to page B, there is a forward edge a->b, here is a simple example:

In this example, there are only four pages, if the current in a page, then the leisurely surfer will be 1/3 of the probability of jumping to B, C, D, here 3 means a has 3 out of the chain, if a page has K bar out of the chain, then jump any one out of the chain probability is 1/k, the same probability of D to B, C 1 2, and the probability of B to C is 0. The transfer matrix is generally used to denote the jump probability of the surfer, and if N is the number of the Web page, the transfer matrix M is a n*n square; if the page J has K-out chain, then on each of the links to the page I, there is m[i][j]=1/k, and other pages of the m[i][j]= 0. The above example diagram corresponds to the transfer matrix as follows:

In the first interview, assuming that the probability of the Internet in each page is equal, that is, 1/n, so the probability distribution of the initial examination is a 1/n of all the values of the N-Willi Vector V0, with the V0 to the right by the transfer matrix M, and then get to the network of probability distribution vector MV0, (nXn) * (nX1) Still get a nX1 matrix. The following is the calculation process for V1:

Note that the matrix M m[i][j] does not represent 0 for the first row of a link from J to i,m multiplied by V0, indicating that the probability of accumulating all pages to page A is 9/24. Get V1, then use V1 to right multiply m to get V2, go down, eventually v will converge, that is VN=MV (n-1), the above diagram example, continuous iteration, final V=[3/9,2/9,2/9,2/9] ':

Iii. Point of termination

The behavior of the Internet users is a Markov process example, to meet the convergence, need to have a condition:

    • The graph is strongly connected, that is, any page can be reached from any other page:

Web pages do not meet the strong connectivity features, because there are some pages do not point to any Web page, if the above calculation, the surfer to reach such a Web page will be cornered, si gu dazed, resulting in the previous cumulative transfer probability is cleared, so go on, the final probability distribution vector all elements are almost 0. Suppose we throw away the link C to a in the above image, C becomes a terminating point, and we get the following figure:

The corresponding transfer matrices are:

Successive iterations, and eventually all elements are 0:

Iv. Trap Issues

Another problem is the trap problem, where some pages do not have links to other pages, but there are links to themselves. For example, the following diagram:

The surfer ran to the C Web page, just like jumping into the trap, into the vortex, can no longer come out from C, will eventually lead to the probability distribution of the value of all transferred to C, which makes the other pages of the probability distribution value of 0, thus the entire page ranking lost meaning. If the transfer matrix corresponds to the diagram above:

As the iteration continues, it becomes this:

V. Solving the problem of termination points and pitfalls

The above process, we overlooked a problem, that is, the surfer is a leisurely surfer, rather than a stupid surfer, our internet users are smart and leisurely, he leisurely, aimless, always randomly select the page, he is smart, go to an end page or a trap page (such as two examples of C), Not silly anxious, he will be in the browser address randomly entered an address, of course, this address may be the original page, but here gave him a chance to escape, let him leave this abyss. Simulation of intelligent and leisurely internet users, to improve the algorithm, every step, the surfer may not want to see the current page, do not see the current page will not click on the connection above, and quietly in the address bar to enter another address, and in the address bar input and jump to each page the probability is 1/n. Assuming that the Internet users each step to view the current page probability is a, then he jumps from the browser address bar probability is (1-a), so the original iteration formula into:

Now let's calculate the probability distribution of a Web map with traps:

Repeat the iteration and get:

You can see that C, although a large part of the PageRank value, but other page pages get some value, so the link structure of C, its weight should be larger.

Vi. using Map-reduce to calculate page Rank

  The above calculation process, using matrix multiplication, iterative, until the iteration before and after the probability distribution vector values change is not small, the general iterative to more than 30 times convergence. The real web structure of the transfer matrix is very large, the current number of pages has more than 10 billion, the transfer matrix is 10 billion * 10 billion matrix, directly by matrix multiplication calculation method is not feasible, need to use Map-reduce calculation method to solve. In fact, Google invented map-reduce originally for distributed computing large-scale web pages Pagerank,map-reduce PageRank There are many ways to implement, I calculate a simple.

Consider the transfer matrix is a lot of sparse matrix, we can use the form of sparse matrix, we put each page in the Web diagram and its linked page as a row, so the fourth section of the Web diagram structure is represented as follows:

1 a    b    c    D2 b    A    D3 c    C4 D    B    C

A There are three out of the chain, pointing to B, C, D, in fact, we crawl the page structure data is this.

1. Map Stage

Each row of the map operation, the 1/k,k of the current page's probability value for all outgoing links is the number of links to the current page, such as the first line of output <B,1/3*1/4>,<C,1/3*1/4>,<D,1/3*1/4>;

2. Reduce phase

The reduce operation collects the same value as the page ID, accumulates and calculates by weight, pj=a* (p1+p2+ ... PM) + (1-a) *1/n, where M is the number of pages J that points to page J, n all pages.

The idea is so simple, but in practice, how to know the current line of the probability of the page in the map stage, need a separate file to save the last round of probability distribution value, first to order, let the chain line and probability value in the same mapper, the whole process is as follows:

Doing this one iteration is equivalent to two mapreduce, but the first mapreduce is just a simple sort, without any action, using Python to invoke the streaming of Hadoop.

The sortmappert.py code is as follows:

1 #!/bin/python2 "Mapper for Sort" 3 import sys4 for line in Sys.stdin:5      print Line.strip ()

Sortreducer.py is the same.

1 #!/bin/python2 "Reducer for Sort" 3 import sys4 for line in Sys.stdin:5       print Line.strip ()

pagerankmapper.py Code:

1 "Mapper of Pangerank algorithm" 2 import sys 3 ID1 = ID2 = None 4 Heros = value = None 5 count1 = Count2 = 0 6  7 for Sys.stdin:8     data = Line.strip (). Split (' \ t ') 9     If len (data) = = 3 and data[1] = = ' A ': # this is the PA Ngerank value10         count1 + = 111         if count1 >= 2:12             print '%s\t%s '% (id1,0.0)         id1 = Data[0]15
   value = float (data[2]) +     else: #This the link relation17         id2 = data[0]18         heros = data[1:]19     if id1 = = Id2 and id1:20         v = value/len (heros) for         hero in heros:22             print '%s\t%s '% (hero,v)         print '%s \t%s '% (id1,0.0)         id1 = Id2 = None25         count1 = 0

pagerankreducer.py Code:

1 "Reducer of PageRank algorithm" 2 import sys 3 last = None 4 values = 0.0 5 Alpha = 0.8 6 N = 4# Size of the Web PA GES 7 for line in Sys.stdin:8     data = Line.strip (). Split (' \ t ') 9     hero,value = Data[0],float (data[1])     if data [0]! = last:11         if last:12             values = Alpha * values + (1-alpha)/N13             print '%s\ta\t%s '% (last,values)         Last = data[0]15         values = value16     else:17         values + = value #accumulate The page rank value18 if last:19     Values = Alpha * values + (1-alpha)/N20     print '%s\ta\t%s '% (last,values)

The process of imitating map-reduce under Linux:

1 #!/bin/bash 2 path=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games 3 export PATH 4 max=10 5 For i in ' seq 1 $max ' 6 do 7     echo "$i" 8     cat links.txt pangerank.value > Tmp.txt  9     cat Tmp.txt |sort|py Thon pagerankmapper.py |sort|python pagerankreducer.py >pangerank.value10 Done

This code can run directly on Hadoop without altering it. To invoke the Hadoop command:

1 #!/bin/bash 2  3 #sort 4 mapper=sortmapper.py 5 reducer=sortreducer.py 6 input= "Yours HDFS dir"/links.txt 7 input= "Yo Urs HDFs dir "/pagerank.value 8 output=" Yours HDFS dir "/tmp.txt 9 Hadoop jar Contrib/streaming/hadoop-*streaming*.jar 11     -mapper/home/hduser/mapper.py     -reducer/home/hduser/reducer.py-file *.py13     -input $input     - Output $output     #Caculator PageRank18 mapper=pagerankmapper.py19 reducer=pagerankreducer.py20 input= " Yours HDFs dir "/tmp.txt21 output=" Yours hdfs dir "/pagerank.value22 Hadoop jar contrib/streaming/hadoop-*streaming*. Jar     -mapper/home/hduser/mapper.py     -reducer/home/hduser/reducer.py-file *.py26     -input $input 27     -output $output

See the references for using Python to manipulate Hadoop. Python code to write a thick C-flavor, look Haihan!

In the fourth step, a trap diagram with a ring, iterating 40 times, weights a 0.8, the results are as follows:

 1 A B C D 2 0.15 0.216666666667 0.416666666667 0.216666666667 3 0.1 36666666666 0.176666666666 0.51 0.176666666666 4 0.120666666666 0.157111111111 0.565111111111 0.1571     11111111 5 0.112844444444 0.145022222222 0.597111111111 0.145022222222 6 0.108008888889 0.138100740741 0.615789629629 0.138100740741 7 0.105240296296 0.134042666667 0.62667437037 0.134042666667 8 0.103    617066667 0.131681145679 0.633020641975 0.131681145679 9 0.102672458272 0.130303676049 0.636720189629 0.130303676049 10 0.10212147042 0.129500792625 0.638876944329 0.129500792625 11 0.10180031705 0.129032 709162 0.640134264625 0.129032709162 12 0.101613083665 0.128759834878 0.640867246578 0.128759834878 1 3 0.101503933951 0.128600756262 0.641294553524 0.128600756262 14 0.101440302505 0.128508018225 0.6415436 61044 0.128508018225   15 0.10140320729 0.128453954625 0.64168888346 0.128453954625 16 0.10138158185 0.128422437127 0.641773    543895 0.128422437127 17 0.101368974851 0.128404063344 0.64182289846 0.128404063344 18 0.101361625338 0.128393351965 0.641851670733 0.128393351965 19 0.101357340786 0.128387107543 0.641868444129 0.12838710 7543 20 0.101354843017 0.128383467227 0.64187822253 0.128383467227 21 0.101353386891 0.128381345029 0 .641883923053 0.128381345029 22 0.101352538012 0.128380107849 0.641887246292 0.128380107849 23 0.1013520 4314 0.128379386609 0.641889183643 0.128379386609 24 0.101351754644 0.128378966148 0.641890313062 0.1 28378966148 25 0.101351586459 0.128378721031 0.641890971481 0.128378721031 26 0.101351488412 0.128378578 135 0.64189135532 0.128378578135 27 0.101351431254 0.128378494831 0.641891579087 0.128378494831 28 0. 101351397932 0.128378446267 0.641891709536 0.128378446267 29 0.101351378507 0.128378417955 0.641891785584 0.128378417955 3 0 0.101351367182 0.128378401451 0.641891829918 0.128378401451 31 0.10135136058 0.128378391829 0.64189185 5763 0.128378391829 32 0.101351356732 0.12837838622 0.64189187083 0.12837838622 33 0.101351354488 0.1    2837838295 0.641891879614 0.12837838295 34 0.10135135318 0.128378381043 0.641891884735 0.128378381043 35 0.101351352417 0.128378379932 0.64189188772 0.128378379932 36 0.101351351973 0.128378379284 0.641891    88946 0.128378379284 37 0.101351351714 0.128378378906 0.641891890474 0.128378378906 38 0.101351351562 0.128378378686 0.641891891065 0.128378378686 39 0.101351351474 0.128378378558 0.64189189141 0.128378378 558 40 0.101351351423 0.128378378483 0.641891891611 0.128378378483 41 0.101351351393 0.128378378439 0 .641891891728 0.128378378439     

You can see that the PageRank value has basically stabilized and is consistent with the fourth step score.

2014.11.16 Note: The above-mentioned map-reduce calculation process is problematic under a multi-node cluster, and in the second map input, the contribution of the node to the other nodes is ignored because the data shard causes the output of some nodes to not be in the same block as the PageRank value. This problem does not occur under a single node. (Thanks to the netizen @ Orange).

PageRank's introduction to this, if you want to go deep can refer to the original paper or the following reference

Reference documents

1. Mining of Massive Datasets

2. "An Introduction to Information Retrival"

3. Using Python to manipulate Hadoop

4.js visualization shows the PageRank calculation process (ladders may be required) to access the author's blog.

Introduction to the PageRank algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.