PageRank algorithm overview, design ideas and source code analysis of MapReduce

Source: Internet
Author: User
Tags shuffle

PageRank algorithm has long been interested, but has always been the concept of contour, no specific in-depth study. To learn and summarize the examples of MapReduce recently, the PageRank algorithm was re-studied again and implemented based on MapReduce.

1. What is PageRank?

PageRank, page rank, right foot page level. Is named after the name of Larry Page, Google's founder. PageRank calculates the PageRank value of each page and sorts the importance of the page based on the size of the PageRank value. PageRank's basic idea is that for a page A, the more pages linked to a, and the larger the PageRank value of the page linked to a, the greater the PageRank value of page A.

2. Simple PageRank calculation

First, we make a simple processing of the entire Web: 1, each page as a node; 2. If page A is linked to page B, there is a forward edge from A to B. The entire web is abstracted into a graph.
Now suppose that there are only four pages of its structure.

It is clear that the graph is strongly connected (one node can be reached from any node). For page A, it links to the page b,c,d, that is, A has 3 out of the chain, then it jumps to each out of the chain b,c,d the probability is 1/3. If A has k-out chains, the probability of jumping to each exit chain is 1/k. In the same vein, the probability of B to A,c,d is 1/2,0,1/2. The probability of C to A,b,d is 1,0,0. The probability of D to A,b,c is 0,1/2,1/2.

Usually we use a suitable data structure to represent the link relationship between pages. Set a total of n pages, you want to generate an n-dimensional matrix, where line I represents the other pages on page I link probability, the J column represents the J page on the other page link probability. Such matrices are called transfer matrices. Corresponds to, the transfer matrix is:

In, the first column of page A on the probability of each page transfer, the first behavior of each page to page a transfer probability. Initially, each page PageRank value is equal, for 1/n, here is also 1/4. Then for page A, according to the PageRank value of each page and the transfer probability of each page to page A, we can calculate the PageRank value of the new round page A. Here, only page B and page C have shifted their own 1/2 to a. So the new round a PageRank value is 1/4*1/2+1/4*1/2=9/24.

For ease of calculation, we set the initial PageRank value for each page to be a column vector V0. Then, based on the transfer matrix, we can directly find a new round of the PageRank value of each page. That is V1 = MV0

Now get the new PageRank value of each page V1, continue to use m to multiply V1, you will get the updated PageRank value. This process has been iterated, which proves that V will eventually converge. The iteration is stopped at this point. At this point, V is the PageRank value of each page. In, the intermediate V that has been iterated is as follows:

3. Handling dead Ends (termination point)

The above PageRank calculation method requires that the entire web is strong. In fact, the real web is not strong unicom, there is a kind of page, they do not exist any outside chain, called dead Ends (termination point). For example, the following figure:

Here node C is a terminating point. And the above algorithm can successfully converge, a large factor based on the transfer matrix of each column plus and 1 (each page has at least one out of the chain). When node C does not have a chain, the transfer matrix M is shown below

Based on this transfer matrix and the initial PageRank column vectors, each iteration of the PageRank column vector is as follows:

One way to solve this problem is to iterate over the dead Ends points and related edges in the diagram, because the iteration is taken away because the new dead Ends point may be generated when the original dead Ends is removed. Until there are no dead ends points in the diagram. Then, for all remaining nodes, calculate their pagerank, and then reverse the ends value of each dead PageRank in reverse order of the dead ends removed.

For example, in the first take out node C, and then found that there is no new dead Ends. Then the a,b,d calculates their PageRank, their initial PageRank value is 1/3, and a has two out of the chain, B has two out of the chain, D has a chain, Then the above method can calculate the final PageRank value of each page. Assuming that the PageRank of A is X, B's PageRank is Y, and D is PageRank Z, then the PageRank value of C is 1/3*x + 1/2*z.

4. Handling Spider Traps (spider Trap)

As you can imagine, if the real web link relationship is transformed into a transfer matrix, it is bound to be a sparse matrix. The sparse matrix iterations multiply to make the PageRank vectors in the middle become less smooth (a small fraction of the value is large, most of it is small or close to 0). A Spider-Traps node can exacerbate this non-smooth effect, which is also the Spiders trap. It means that some nodes, though there are chains, are only chained to themselves. As shown in the following:

If you iterate over this figure according to the above method, the calculation will find that the PageRank values of all nodes are gradually transferred to node C, while the other nodes are approaching zero pagerank.

To solve this problem, we need to do a smoothing of the PageRank calculation method – Adding a teleporting (jump factor). In other words, when a user accesses a Web page, he or she may enter an address directly on the address bar to access it, in addition to following the link relationship of the Web page. This avoids the situation where the user can only access the page itself, or enter a page.

After adding the jump factor, the calculation formula of the PageRank vector is modified to:

which Beta Usually set to a very small number (0.2 or 0.15), E is the unit vector, N is the number of all pages, multiplied by 1/n because the probability of a random jump to a page is 1/n. Thus, each time the PageRank value is computed, it relies on the transfer matrix and relies on the random jump of the small probability.

For example, the improved PageRank value is calculated as follows:

Following this calculation formula, the spider traps effect is suppressed, allowing each page to get a reasonable PageRank value.

5. PageRank design ideas based on MapReduce

Based on the above example, our input sample is

In each row, the first column is page I, the second is the PageRank value of the page, and the subsequent columns are the pages I link to.
Because we want to iterate over the PageRank value, the output of each mapreduce is the same as the input format, so that the output of the MapReduce is used as input to the next mapreduce.

So each time we get the output, it should be this:

Design of 5.1 map process
    • Parse each line of text to get the current page, the PageRank value of the current page, and the other pages to which the current page is linked
    • Calculate the number of other users to link to, and then find out the current page to the other page contribution value.
    • The output is designed to output two types:
      • The key in the first output of the < key,value> represents a different page, and value represents the contribution of the current page to other pages.
      • The second output of the < key,value> key represents the current page, and value represents all other pages.
      • Also, in order to distinguish between the two outputs, the value of the first output is added "@" and the value of the second output is added "&"

After the map process, you get the following results:

After the map results are output, the shuffle process is sorted and merged (the system is automatically implemented) and the results are as follows:

Shuffle the results show that the output key of the shuffle process represents a Web page, and the output value represents a list with two categories: one is the contribution value obtained from the other pages, and the other is the page with all the links

Design of 5.2 Reduce process
    • The output of the Shuffule is also the input of reduce.
    • The key for the reduce input directly as the output key
    • The value of the reduce input is parsed, which is a list,
      • If the value in the list contains "@", the string after the value "@" is converted to float type plus
      • If the value in the list contains "&", the string after the value "&" is extracted
      • The sum of all contributing values, and the extracted string, are concatenated as the output value of reduce
6. Source Code Analysis

Let's take a look at the code implemented by Pankrank

package Org. Apache. Hadoop. Examples;Import Java(i). IOException;Import Java. Util. StringTokenizer;import org. Apache. Hadoop. conf. Configuration;import org. Apache. Hadoop. FS. Path;import org. Apache. Hadoop. IO. Text;import org. Apache. Hadoop. MapReduce. Job;import org. Apache. Hadoop. MapReduce. Mapper;import org. Apache. Hadoop. MapReduce. Reducer;import org. Apache. Hadoop. MapReduce. Reducer. Context;import org. Apache. Hadoop. MapReduce. Lib. Input. Fileinputformat;import org. Apache. Hadoop. MapReduce. Lib. Output. Fileoutputformat;public class Pagerank_fourth {/*map Process * /public static class Lxnmapper extends mapper<object,text,text,text>{private String ID;Private float PR; private int Count;Private float AVERAGE_PR; public void Map (Object key,text value,context Context) throws ioexception,interruptedexception{ StringTokenizer str = new StringTokenizer (value. toString());//parsing valueID =STR. NextToken();//id is the first word to parse, representing the current pagePR = Float. parsefloat(str. NextToken());//PR is the second word parsed, converted to float type, representing PageRank valueCount = str. Counttokens()//count is the number of remaining words, representing the number of pages in the current page.AVERAGE_PR = Pr/count;//Find out the contribution of the current Web page to the link pageString linkIDs ="&"The following are the two types of outputs, each with ' @ ' and ' & ' differentiatedwhile (str. Hasmoretokens()) {String LinkId = str. NextToken();Context. Write(new text (LinkId), new text ("@"+AVERAGE_PR));//output is the < out of the chain page, the contribution value obtained >linkIDs + =" "+ LinkId;} context. Write(new text (ID), new text (linkIDs));//output is < Current page, all out of the chain page >}           }/*reduce Process * /public static class Lxnreduce extends reducer<text,text,text,text>{public void reduce (Text key,iterable <Text> Values,context Context) throws ioexception,interruptedexception{String Lia Njie ="";Float PR =0;            /* Analyze each Val in the values, through which the first character is ' @ ' or ' & ' to determine the sum of the contribution values obtained by the current Web page, which is the new PageRank value, and to find out the current network Page all out of the chain page * /for (Text val:values) {if (val. toString(). Substring(0,1). Equals("@")) {PR + = Float. parsefloat(Val. toString(). Substring(1));} else if (val. toString(). Substring(0,1). Equals("&") {Lianjie + = Val. toString(). Substring(1);}} PR =0.8F*PR +0.2f*0.25F;//Add a jump factor to smooth the processString result = Pr+lianjie;Context. Write(Key, new Text (Result)); }} public static void Main (string[] args) throws exception{configuration conf = new configuration ();Conf. Set("Mapred.job.tracker","172.16.10.15:9001");String pathIn1 ="Hdfs://172.16.10.15:9000/user/hadoop/pagerank_fourthinput";String pathout="Hdfs://172.16.10.15:9000/user/hadoop/pagerank_fourthoutput";Job Job = new Job (conf,"Page Rank");Job. Setjarbyclass(Pagerank_fourth. Class);Job. Setmapperclass(Lxnmapper. Class);Job. Setreducerclass(Lxnreduce. Class);Job. Setoutputkeyclass(Text. Class);Job. Setoutputvalueclass(Text. Class);Fileinputformat. Addinputpath(Job, New Path (PATHIN1));Fileoutputformat. Setoutputpath(Job, New Path (Pathout));System. Exit(Job. WaitForCompletion(true)?0:1);}}

Now, it is possible to complete an iteration, but the PageRank value usually needs to be iterated 30-40 times to achieve convergence, so the program needs to be changed a little to allow it to perform an automatic iterative operation.
You only need to change the main function main. The following post changes the main function

 Public Static voidMain (string[] args) throws exception{Configuration conf =NewConfiguration (); Conf.Set("Mapred.job.tracker","172.16.10.15:9001"); String pathIn1 ="Hdfs://172.16.10.15:9000/user/hadoop/pagerank_fourthinput"; String pathout="Hdfs://172.16.10.15:9000/user/hadoop/pagerank_fourthoutput0"; for(intI=1;i< A; i++) {//join for LoopJob Job =NewJob (Conf,"Page Rank"); Job.setjarbyclass (Pagerank_fourth.class); Job.setmapperclass (Lxnmapper.class); Job.setreducerclass (lxnreduce.class); Job.setoutputkeyclass (Text.class); Job.setoutputvalueclass (Text.class); Fileinputformat.addinputpath (Job,NewPath (PATHIN1)); Fileoutputformat.setoutputpath (Job,NewPath (pathout)); PathIn1 = Pathout;//Change the address of the output to the input address of the next iterationPathout = Pathout+i;//Set the next output to a new address. Job.waitforcompletion (true);//Remove the System.exit ()}    }

The following is the output after iterating 40 times:

Reference documents
    1. Zhang Yang: An analysis of PageRank algorithm http://blog.jobbole.com/23286/
    2. PageRank algorithm http://blog.csdn.net/magicnumber/article/details/43853547
    3. The PageRank algorithm on Hadoop http://www.cnblogs.com/dandingyy/archive/2013/03/08/2950740.html
    4. Introduction of PageRank algorithm and implementation of Map-reduce http://www.cnblogs.com/fengfenggirl/p/pagerank-introduction.html

PageRank algorithm overview, design ideas and source code analysis of MapReduce

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.