The link between each page in the Internet we can be seen as a graph, the importance of a page from the link to the other pages of the page to vote, a more linked pages will have a higher level, or if a page is not linked or linked to a lower level, the higher the PR value of the page, the more important to represent the page
Suppose A, B, C, d four pages composed of a collection, B, C, d three pages are linked to a, then A's PR value will be B, C, D three pages of the PR value of the sum:
PR (A) =PR (B) +PR (C) +PR (D)
Continue to the above hypothesis, b in addition to link to a, but also link to C and d,c in addition to link to a, but also link to B, and D only link to a, so in the calculation of a PR value, b PR value can only cast 1/3 votes, c PR value can only cast 1/2 votes, and D only link to a, so can cast a So the sum of the PR value of a should be:
P R (A) =PR (B)/3+PR (C)/2+PR (D)
So you can get a page of the PR value calculation formula should be:
Where Bu is the collection of all pages linked to page U, page v is a Web page belonging to the collection BU, and L (v) is the external link number of page V (that is, out of the degree)
Figure 1-1
Table 1-2 The PR value calculated according to Figure 1-1
|
PA (A) |
P (B) |
PR (C) |
PR (D) |
Initial value |
0.25 |
0.25 |
0.25 |
0.25 |
One iteration |
0.125 |
0.333 |
0.083 |
0.458 |
Two iterations |
0.1665 |
0.4997 |
0.0417 |
0.2912 |
...... |
...... |
...... |
...... |
...... |
N iterations |
0.1999 |
0.3999 |
0.0666 |
0.3333 |
Table 1-2, after several iterations, the PR value gradually convergence and stability
However, in the actual network environment, hyperlinks are not so idealized, such as the PageRank model will encounter the following two problems:
1. Ranking leaks
1-3, if there is no link to the Web page, as shown in a node, there will be a ranking leak problem, after many iterations, all the Web page PR only tend to 0
Figure 1-3
|
PR (A) |
PR (B) |
PR (C) |
PR (D) |
Initial value |
0.25 |
0.25 |
0.25 |
0.25 |
One iteration |
0.125 |
0.125 |
0.25 |
0.25 |
Two iterations |
0.125 |
0.125 |
0.125 |
0.25 |
Three iterations |
0.125 |
0.125 |
0.125 |
0.125 |
...... |
...... |
...... |
...... |
...... |
N iterations |
0 |
0 |
0 |
0 |
Table 1-4 shows the results after several iterations of Figure 1-3
2. Ranking sinking
1-5, if the page does not have a link to the degree, as shown in Node A, after many iterations, the PR value of a will tend to 0
Figure 1-5
|
PR (A) |
PR (B) |
PR (C) |
PR (D) |
Initial value |
0.25 |
0.25 |
0.25 |
0.25 |
One iteration |
0 |
0.375 |
0.25 |
0.375 |
Two iterations |
0 |
0.375 |
0.375 |
0.25 |
Three iterations |
0 |
0.25 |
0.375 |
0.375 |
......
|
...... |
...... |
...... |
...... |
N iterations |
0 |
...... |
...... |
...... |
Table 1-5 shows the results after several iterations of Figure 1-4
We assume that the king at random from a page start browsing, and constantly click on the current page of the link to browse, until the link to a page without any degree of link, or the king is bored, randomly go to another page to start a new round of browsing, this model is obviously closer to the user's habits. In order to handle pages that do not have any outbound links, a damping factor D is introduced to indicate the probability that the user will continue to navigate backwards after reaching the Miugo page. It is generally set to 0.85, and 1-d is the probability that the user stops clicking and randomly goes to another page to start a new round of browsing, so the PageRank formula for introducing a random browsing model is as follows:
Figure 1-6
Code 1-7 sets up the dataset based on Figure 1-6
[Email protected]:/data# cat links A b,c,db a,dc DD B
The main purpose of the Graphbuilder step is to establish a link graph between the pages and assign the initial PR value to each page, since our dataset has established a link graph between the pages, so in code 1-8 just assign equal PR value to each page
Code 1-8
Package Com.hadoop.mapreduce;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.path;import Org.apache.hadoop.io.longwritable;import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapreduce.job;import Org.apache.hadoop.mapreduce.mapper;import Org.apache.hadoop.mapreduce.lib.input.fileinputformat;import Org.apache.hadoop.mapreduce.lib.output.fileoutputformat;public class Graphbuilder {public static class Graphbuildermapper extends Mapper<longwritable, text, text, text> {private Text url = new text ();p rivate text Linkur L = new Text ();p rotected void map (longwritable key, Text value, Context context) throws Java.io.IOException, Interruptedex ception {string[] tuple = value.tostring (). Split ("\ T"), Url.set (Tuple[0]), if (tuple.length = = 2) {//page has an out-of-degree linkurl.set ( TUPLE[1] + "\t1.0");} else {//page no out Linkurl.set ("\t1.0");} Context.write (URL, linkurl);};} protected static void Main (string[] args) throws Exception {configuration conf = new Configuration (); Job Job = Job.getInstance (conf), Job.setjobname ("Graph Builder"), Job.setjarbyclass (Graphbuilder.class); Job.setoutputkeyclass ( Text.class); Job.setoutputvalueclass (Text.class); Job.setmapperclass (Graphbuildermapper.class); Fileinputformat.addinputpath (Job, New Path (Args[0])); Fileoutputformat.setoutputpath (Job, New Path (Args[1])); Job.waitforcompletion (True);}}
The main purpose of the Pagerankiter step is to iterate over the PageRank values until the end-of-operation condition is met, such as convergence or the number of predetermined iterations, where a preset number of iterations is used to run the step multiple times.
Code 1-9
Package Com.hadoop.mapreduce;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.path;import Org.apache.hadoop.io.longwritable;import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapreduce.job;import Org.apache.hadoop.mapreduce.mapper;import Org.apache.hadoop.mapreduce.reducer;import Org.apache.hadoop.mapreduce.lib.input.fileinputformat;import Org.apache.hadoop.mapreduce.lib.output.fileoutputformat;public class Pagerankiter {private static final double Damping = 0.85;public Static class Pritermapper extends Mapper<longwritable, text, text, text> {private Text Prkey = New text ();p rivate text prvalue = new text ();p rotected void map (longwritable key, Text value, Context context) throws JAV A.io.ioexception, interruptedexception {string[] tuple = value.tostring (). Split ("\ t"); if (Tuple.length <= 2) {return ;} string[] Linkpages = Tuple[1].split (",");d ouble PR = double.parsedouble (tuple[2]); for (String page:linkpages) {if (page. IsEmpty ()) {continue;} prkey.seT (page);p Rvalue.set (tuple[0] + "\ T" + pr/linkpages.length); Context.write (Prkey, prvalue);} Prkey.set (tuple[0]);p rvalue.set ("|" + tuple[1]); Context.write (Prkey, prvalue);};} public static class Priterreducer extends Reducer<text, text, text, text> {private Text prvalue = new text ();p Rotect ed void reduce (Text key, iterable<text> values, context context) throws Java.io.IOException, Interruptedexception { String links = "";d ouble PageRank = 0;for (Text val:values) {String tmp = val.tostring (); if (Tmp.startswith ("|")) {links = tmp.substring (Tmp.indexof ("|") + 1); continue;} string[] tuple = tmp.split ("\ t"); if (Tuple.length > 1) {PageRank + = double.parsedouble (tuple[1]);}} PageRank = (1-damping) + damping * Pagerank;prvalue.set (links + "\ T" + PageRank); Context.write (key, Prvalue);};} protected static void Main (string[] args) throws Exception {configuration conf = new Configuration (); Job Job = job.getinstance (conf), Job.setjobname ("Pagerankiter"); Job.setjarbyclass (pagerankiter.clJob.setoutputkeyclass (Text.class); Job.setoutputvalueclass (Text.class); Job.setmapperclass ( Pritermapper.class); Job.setreducerclass (Priterreducer.class); Fileinputformat.addinputpath (Job, New Path (Args[0])); Fileoutputformat.setoutputpath (Job, New Path (Args[1])); Job.waitforcompletion (True);}}
The Pagerankviewer step outputs the final ranking result of the iteration calculation according to the PageRank value from the largest to the small order, does not need reduce, the output key value pair for <PageRank,URL>
Code 1-10
Package Com.hadoop.mapreduce;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.path;import Org.apache.hadoop.io.floatwritable;import Org.apache.hadoop.io.longwritable;import Org.apache.hadoop.io.Text; Import Org.apache.hadoop.io.writablecomparable;import Org.apache.hadoop.io.writablecomparator;import Org.apache.hadoop.mapreduce.job;import Org.apache.hadoop.mapreduce.mapper;import Org.apache.hadoop.mapreduce.lib.input.fileinputformat;import Org.apache.hadoop.mapreduce.lib.output.fileoutputformat;public class Pagerankviewer {public static class Pagerankviewermapper extends Mapper<longwritable, text, floatwritable, text> {private Text outpage = new Text ();p RI vate floatwritable OUTPR = new floatwritable ();p rotected void map (longwritable key, Text value, Context context) throws Ja Va.io.IOException, interruptedexception {string[] line = Value.tostring (). Split ("\ t"); Outpage.set (line[0]); O Utpr.set (Float.parsefloat (line[2)); Context.write (OUTPR, outpage);};} Public STATic class Descfloatcomparator extends Floatwritable.comparator {public float compare (Writablecomparator A, Writablecomparable<floatwritable> b) {Return-super.compare (A, b);} public int Compare (byte[] b1, int s1, int L1, byte[] b2, int s2, int l2) {Return-super.compare (B1, S1, L1, B2, S2, L2);} }protected static void Main (string[] args) throws Exception {configuration conf = new Configuration (); Job Job = job.getinstance (conf); Job.setjobname ("Pagerankviewer"); Job.setjarbyclass (Pagerankviewer.class); Job.setoutputkeyclass (Floatwritable.class); Job.setsortcomparatorclass (Descfloatcomparator.class); Job.setoutputvalueclass (Text.class); Job.setmapperclass (Pagerankviewermapper.class); Fileinputformat.addinputpath (Job, New Path (Args[0])); Fileoutputformat.setoutputpath (Job, New Path (Args[1])); Job.waitforcompletion (True);}}
Code 1-11
Package Com.hadoop.mapreduce;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.FileSystem; Import Org.apache.hadoop.fs.path;public class Pagerankdriver {public static void main (string[] args) throws Exception {if (args = = NULL | | Args.length! = 3) {throw new RuntimeException ("Enter input path, output path, and iteration count");} int times = Integer.parseint (args[2]), if (<= 0) {throw new RuntimeException ("The number of iterations must be greater than 0");} String UUID = "/" + Java.util.UUID.randomUUID (). toString (); String[] Forgb = {Args[0], UUID + "/data0"}; Graphbuilder.main (FORGB); String[] Foritr = {"", "" "};for (int i = 0; I < times; i++) {Foritr[0] = uuid + "/data" + i;foritr[1] = uuid + "/data" + (i + 1); Pagerankiter.main (FORITR);} String[] Forrv = {UUID + "/data" + Times, args[1]}; Pagerankviewer.main (FORRV); Configuration conf = new configuration (); FileSystem fs = Filesystem.get (conf); Path PATH = new Path (UUID); fs.deleteonexit (path);}}
Run code 1-11 with the result as shown in code 1-12
Code 1-12
[Email protected]:/data# hadoop jar Pagerank.jar com.hadoop.mapreduce.pagerankdriver/data/output 10
...... [Email protected]:/data# Hadoop fs-ls-r/output
-rw-r--r--1 root supergroup 0 2017-02-10 17:55/output/_success
-rw-r--r--1 root supergroup 2017-02-10 17:55/output/part-r-00000
[Email protected]:/data# Hadoop fs-cat/output/part-r-00000
1.5149547 B
1.3249696 D
0.78404236 A
0.37603337 C
PageRank page ranking algorithm