PageRank page ranking algorithm

Last Update:2017-02-10 Source: Internet

Author: User

Tags uuid hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The link between each page in the Internet we can be seen as a graph, the importance of a page from the link to the other pages of the page to vote, a more linked pages will have a higher level, or if a page is not linked or linked to a lower level, the higher the PR value of the page, the more important to represent the page

Suppose A, B, C, d four pages composed of a collection, B, C, d three pages are linked to a, then A's PR value will be B, C, D three pages of the PR value of the sum:

PR (A) =PR (B) +PR (C) +PR (D)

Continue to the above hypothesis, b in addition to link to a, but also link to C and d,c in addition to link to a, but also link to B, and D only link to a, so in the calculation of a PR value, b PR value can only cast 1/3 votes, c PR value can only cast 1/2 votes, and D only link to a, so can cast a So the sum of the PR value of a should be:

P R (A) =PR (B)/3+PR (C)/2+PR (D)

So you can get a page of the PR value calculation formula should be:

Where Bu is the collection of all pages linked to page U, page v is a Web page belonging to the collection BU, and L (v) is the external link number of page V (that is, out of the degree)

Figure 1-1

Table 1-2 The PR value calculated according to Figure 1-1

	PA (A)	P (B)	PR (C)	PR (D)
Initial value	0.25	0.25	0.25	0.25
One iteration	0.125	0.333	0.083	0.458
Two iterations	0.1665	0.4997	0.0417	0.2912
......	......	......	......	......
N iterations	0.1999	0.3999	0.0666	0.3333

Table 1-2, after several iterations, the PR value gradually convergence and stability

However, in the actual network environment, hyperlinks are not so idealized, such as the PageRank model will encounter the following two problems:

1. Ranking leaks

1-3, if there is no link to the Web page, as shown in a node, there will be a ranking leak problem, after many iterations, all the Web page PR only tend to 0

Figure 1-3

	PR (A)	PR (B)	PR (C)	PR (D)
Initial value	0.25	0.25	0.25	0.25
One iteration	0.125	0.125	0.25	0.25
Two iterations	0.125	0.125	0.125	0.25
Three iterations	0.125	0.125	0.125	0.125
......	......	......	......	......
N iterations	0	0	0	0

Table 1-4 shows the results after several iterations of Figure 1-3

2. Ranking sinking

1-5, if the page does not have a link to the degree, as shown in Node A, after many iterations, the PR value of a will tend to 0

Figure 1-5

	PR (A)	PR (B)	PR (C)	PR (D)
Initial value	0.25	0.25	0.25	0.25
One iteration	0	0.375	0.25	0.375
Two iterations	0	0.375	0.375	0.25
Three iterations	0	0.25	0.375	0.375
......	......	......	......	......
N iterations	0	......	......	......

Table 1-5 shows the results after several iterations of Figure 1-4

We assume that the king at random from a page start browsing, and constantly click on the current page of the link to browse, until the link to a page without any degree of link, or the king is bored, randomly go to another page to start a new round of browsing, this model is obviously closer to the user's habits. In order to handle pages that do not have any outbound links, a damping factor D is introduced to indicate the probability that the user will continue to navigate backwards after reaching the Miugo page. It is generally set to 0.85, and 1-d is the probability that the user stops clicking and randomly goes to another page to start a new round of browsing, so the PageRank formula for introducing a random browsing model is as follows:

Figure 1-6

Code 1-7 sets up the dataset based on Figure 1-6

[Email protected]:/data# cat links A       b,c,db       a,dc       DD       B

The main purpose of the Graphbuilder step is to establish a link graph between the pages and assign the initial PR value to each page, since our dataset has established a link graph between the pages, so in code 1-8 just assign equal PR value to each page

Code 1-8

Package Com.hadoop.mapreduce;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.path;import Org.apache.hadoop.io.longwritable;import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapreduce.job;import Org.apache.hadoop.mapreduce.mapper;import Org.apache.hadoop.mapreduce.lib.input.fileinputformat;import Org.apache.hadoop.mapreduce.lib.output.fileoutputformat;public class Graphbuilder {public static class Graphbuildermapper extends Mapper<longwritable, text, text, text> {private Text url = new text ();p rivate text Linkur L = new Text ();p rotected void map (longwritable key, Text value, Context context) throws Java.io.IOException, Interruptedex ception {string[] tuple = value.tostring (). Split ("\ T"), Url.set (Tuple[0]), if (tuple.length = = 2) {//page has an out-of-degree linkurl.set ( TUPLE[1] + "\t1.0");} else {//page no out Linkurl.set ("\t1.0");} Context.write (URL, linkurl);};} protected static void Main (string[] args) throws Exception {configuration conf = new Configuration (); Job Job = Job.getInstance (conf), Job.setjobname ("Graph Builder"), Job.setjarbyclass (Graphbuilder.class); Job.setoutputkeyclass ( Text.class); Job.setoutputvalueclass (Text.class); Job.setmapperclass (Graphbuildermapper.class); Fileinputformat.addinputpath (Job, New Path (Args[0])); Fileoutputformat.setoutputpath (Job, New Path (Args[1])); Job.waitforcompletion (True);}}

The main purpose of the Pagerankiter step is to iterate over the PageRank values until the end-of-operation condition is met, such as convergence or the number of predetermined iterations, where a preset number of iterations is used to run the step multiple times.

Code 1-9

Package Com.hadoop.mapreduce;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.path;import Org.apache.hadoop.io.longwritable;import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapreduce.job;import Org.apache.hadoop.mapreduce.mapper;import Org.apache.hadoop.mapreduce.reducer;import Org.apache.hadoop.mapreduce.lib.input.fileinputformat;import Org.apache.hadoop.mapreduce.lib.output.fileoutputformat;public class Pagerankiter {private static final double  Damping = 0.85;public Static class Pritermapper extends Mapper<longwritable, text, text, text> {private Text Prkey = New text ();p rivate text prvalue = new text ();p rotected void map (longwritable key, Text value, Context context) throws JAV A.io.ioexception, interruptedexception {string[] tuple = value.tostring (). Split ("\ t"); if (Tuple.length <= 2) {return ;} string[] Linkpages = Tuple[1].split (",");d ouble PR = double.parsedouble (tuple[2]); for (String page:linkpages) {if (page. IsEmpty ()) {continue;} prkey.seT (page);p Rvalue.set (tuple[0] + "\ T" + pr/linkpages.length); Context.write (Prkey, prvalue);} Prkey.set (tuple[0]);p rvalue.set ("|" + tuple[1]); Context.write (Prkey, prvalue);};} public static class Priterreducer extends Reducer<text, text, text, text> {private Text prvalue = new text ();p Rotect ed void reduce (Text key, iterable<text> values, context context) throws Java.io.IOException, Interruptedexception { String links = "";d ouble PageRank = 0;for (Text val:values) {String tmp = val.tostring (); if (Tmp.startswith ("|")) {links = tmp.substring (Tmp.indexof ("|") + 1); continue;} string[] tuple = tmp.split ("\ t"); if (Tuple.length > 1) {PageRank + = double.parsedouble (tuple[1]);}} PageRank = (1-damping) + damping * Pagerank;prvalue.set (links + "\ T" + PageRank); Context.write (key, Prvalue);};} protected static void Main (string[] args) throws Exception {configuration conf = new Configuration (); Job Job = job.getinstance (conf), Job.setjobname ("Pagerankiter"); Job.setjarbyclass (pagerankiter.clJob.setoutputkeyclass (Text.class); Job.setoutputvalueclass (Text.class); Job.setmapperclass ( Pritermapper.class); Job.setreducerclass (Priterreducer.class); Fileinputformat.addinputpath (Job, New Path (Args[0])); Fileoutputformat.setoutputpath (Job, New Path (Args[1])); Job.waitforcompletion (True);}}

The Pagerankviewer step outputs the final ranking result of the iteration calculation according to the PageRank value from the largest to the small order, does not need reduce, the output key value pair for <PageRank,URL>

Code 1-10

Package Com.hadoop.mapreduce;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.path;import Org.apache.hadoop.io.floatwritable;import Org.apache.hadoop.io.longwritable;import Org.apache.hadoop.io.Text; Import Org.apache.hadoop.io.writablecomparable;import Org.apache.hadoop.io.writablecomparator;import Org.apache.hadoop.mapreduce.job;import Org.apache.hadoop.mapreduce.mapper;import Org.apache.hadoop.mapreduce.lib.input.fileinputformat;import Org.apache.hadoop.mapreduce.lib.output.fileoutputformat;public class Pagerankviewer {public static class Pagerankviewermapper extends Mapper<longwritable, text, floatwritable, text> {private Text outpage = new Text ();p RI vate floatwritable OUTPR = new floatwritable ();p rotected void map (longwritable key, Text value, Context context) throws Ja Va.io.IOException, interruptedexception {string[] line = Value.tostring (). Split ("\ t"); Outpage.set (line[0]); O Utpr.set (Float.parsefloat (line[2)); Context.write (OUTPR, outpage);};} Public STATic class Descfloatcomparator extends Floatwritable.comparator {public float compare (Writablecomparator A, Writablecomparable<floatwritable> b) {Return-super.compare (A, b);} public int Compare (byte[] b1, int s1, int L1, byte[] b2, int s2, int l2) {Return-super.compare (B1, S1, L1, B2, S2, L2);} }protected static void Main (string[] args) throws Exception {configuration conf = new Configuration (); Job Job = job.getinstance (conf); Job.setjobname ("Pagerankviewer"); Job.setjarbyclass (Pagerankviewer.class); Job.setoutputkeyclass (Floatwritable.class); Job.setsortcomparatorclass (Descfloatcomparator.class); Job.setoutputvalueclass (Text.class); Job.setmapperclass (Pagerankviewermapper.class); Fileinputformat.addinputpath (Job, New Path (Args[0])); Fileoutputformat.setoutputpath (Job, New Path (Args[1])); Job.waitforcompletion (True);}}

Code 1-11

Package Com.hadoop.mapreduce;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.FileSystem;  Import Org.apache.hadoop.fs.path;public class Pagerankdriver {public static void main (string[] args) throws Exception {if (args = = NULL | | Args.length! = 3) {throw new RuntimeException ("Enter input path, output path, and iteration count");} int times = Integer.parseint (args[2]), if (<= 0) {throw new RuntimeException ("The number of iterations must be greater than 0");} String UUID = "/" + Java.util.UUID.randomUUID (). toString (); String[] Forgb = {Args[0], UUID + "/data0"}; Graphbuilder.main (FORGB); String[] Foritr = {"", "" "};for (int i = 0; I < times; i++) {Foritr[0] = uuid + "/data" + i;foritr[1] = uuid + "/data" + (i + 1); Pagerankiter.main (FORITR);} String[] Forrv = {UUID + "/data" + Times, args[1]}; Pagerankviewer.main (FORRV); Configuration conf = new configuration (); FileSystem fs = Filesystem.get (conf); Path PATH = new Path (UUID); fs.deleteonexit (path);}}

Run code 1-11 with the result as shown in code 1-12

Code 1-12

[Email protected]:/data# hadoop jar Pagerank.jar com.hadoop.mapreduce.pagerankdriver/data/output 10
...... [Email protected]:/data# Hadoop fs-ls-r/output
-rw-r--r--1 root supergroup 0 2017-02-10 17:55/output/_success
-rw-r--r--1 root supergroup 2017-02-10 17:55/output/part-r-00000
[Email protected]:/data# Hadoop fs-cat/output/part-r-00000
1.5149547 B
1.3249696 D
0.78404236 A
0.37603337 C

PageRank page ranking algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More