PageRank algorithm

PageRank algorithm is Google once Shong "leaning on the Sky Sword", The algorithm by Larry Page and http://www.aliyun.com/zixun/aggregation/16959.html ">sergey Brin was invented at Stanford University, and the paper was clicked to download: the PageRank citation ranking:bringing to the Web.

In this paper, some reference documents are used to elicit the problems, then some realization algorithms of PageRank are given, and finally the PageRank algorithm is extended to the mapreduce frame.

PageRank's core ideas are 2 points:

1. If a Web page is linked to many other Web pages, this page is more important, that is, the PageRank value will be relatively high;

2. If a page with a PageRank value is linked to a different page, the PageRank value of the page being linked to will increase accordingly.

Here is a picture from Wikipedia, each of which represents a Web page, and the size of the ball reflects the size of the PageRank value of the page. Links to page B and Web page e are many, so B and e have a higher PageRank value, and although very few pages point to C, the most important page B points to C, so C's PageRank value is larger than E.

Reference content:

1.Wiki about PageRank

2.Google Secrets-PageRank thoroughly explain Chinese version

3. Numerical analysis and algorithm Page 161 application Example: Google's PageRank algorithm

4.Numeric Methods with matlab or Chinese translation version matlab numerical calculation

5. Calculate PageRank Page PageRank and Markov chains using MapReduce thought

1. Problem background

From reference 3

2. Mathematical modeling

From reference 3, Understanding Web link matrix $g$, Markov process ("web surfing"), transfer matrix $a$, probability $p$ for users to click on a link address in the current page (generally 0.85).

Finally, we get an equation $ax=x$, which is actually a eigenvector that $a$ the eigenvalues of a matrix of 1.

The following section uses the disk theorem to explain that 1 is the main eigenvalue of the matrix $a$, so we can solve it by using the power method.

A detailed introduction to the Power method reference another article numerical Methods Using Matlab: Chapter three matrix eigenvalue and singular value solution

3. Solving PageRank

Suppose the page link model as shown on the right side of the figure above.

(1) Power method

There is a PageRank simple algorithm on the wiki, it does not consider the transfer probability, but uses the iterative method, each time updates each page PageRank value, the update way is to divide each page's PageRank value evenly to it points to all pages, Each Web page accumulates all the pages that point to it as the value of its PageRank value, until the PageRank value of all pages converges or a certain threshold condition is met.

The implementation of the PageRank algorithm in the MapReduce framework is followed by this idea. Considering the transition probability and the algorithm is similar, multiply the last transfer probability plus a random jump probability.

According to the above thought, the following MATLAB code implementation can get the PageRank value of each page.

N=6;i=[2 3 4 4 5 6 1 6 1];j=[1 2 2 3 3 3 4 5-6]; G=sparse (i,j,1,n,n);% power Methodfor j = 1:n L{j} = Find (G (:, j)); C (j) = Length (l{j}); ENDP =. 85;delta = (1-p)/n;x = Ones (n,1)/n;z = zeros (n,1); cnt = 0;while Max (ABS (z)) > 0001 z = x; x = zeros (n,1); For j = 1:n if C (j) = = 0 x = x + z (j)/n;% transfer to any Web page else x (l{j}) = X (l{j}) + Z (j)/C (j);% pagerank the last value to all pointed pages end x = p*x + Del Ta; CNT = Cnt+1;end

The resulting vector $x$ saves the PageRank value of each page, although the number of links is the same, but the page ① is higher than the page ④ and page ⑤, and the ② value of the page PageRank is the second highest, because the page ① link to it above, the equivalent of stained with the page ① light.

x = 0.2675 0.2524 0.1323 0.1698 0.0625 0.1156

This article gives a python version of the algorithm implementation, the blogger uses a Third-party module Python-graph,python-graph module to implement a number of graph algorithms, the use of the module, the first need to install the code as follows:

Easy_install Python-graph-coreeasy_install Python-graph-dot

The python version of the algorithm implementation:

# coding=utf-8# python-graph https://code.google.com/p/python-graph/# import graphvizimport Graphviz as gv# import Pygraphfrom pygraph.classes.digraph Import digraphfrom pygraph.readwrite.dot import write# Define pagerank functiondef PageRank (graph, damping_factor=0.85, max_iterations=100, \ min_delta=0.00001): "" "Compute and Return of the PageRank in an Directed graph. @type graph:digraph @param graph:digraph. @type damping_factor:number @param damping_factor:pagerank dumping factor. @type Max_iterations:number @param max_iterations:maximum number of iterations. @type min_delta:number @param min_delta:smallest Variation required for a new according. @rtype: Dict @return: Dict containing all nodes PageRank. "" "Nodes = Graph.nodes () graph_size = Len (nodes) if graph_size = = 0:return {} # Value for nodes without inbound links Min_value = (1.0-damping_factor)/graph_size # itialize the page rank dict with 1/n to all nodes #pagerank = Dict.fromkeys (nodes, 1.0/ Graph_size) PageranK = Dict.fromkeys (nodes, 1.0) for I in Range (max_iterations): diff = 0 #total difference compared to last Iteraction # computes Each node PageRank based on inbound links for node in Nodes:rank = Min_value for referring_page in Graph.incidents (node): Rank + = Damping_factor * Pagerank[referring_page]/len (graph.neighbors (Referring_page)) diff + + ABS (Pagerank[node]-rank) Pagerank[node] = Rank print ' This is no.%s according '% (i+1) Print PageRank print ' #stop if PageRank super-delegates converged if diff < m In_delta:break return pagerank# Graph Creationgr = digraph () # ADD nodes and Edgesgr.add_nodes (["1", "2", "3", "4"]) gr.add_ Edge (("1", "2")) Gr.add_edge (("1", "3")) Gr.add_edge (("1", "4")) Gr.add_edge (("2", "3")) Gr.add_edge (("2", "4")) Gr.add_ Edge (("3", "4")) Gr.add_edge (("4", "2")) # Draw as png# dot = Write (gr) # GVV = gv.readstring (dot) # gv.layout (gvv, ' dot ') # Gv.render (GVV, ' png ', ' model.png ') PageRank (GR)

The results obtained after 32 iterations are the same as the previous results:

This is no.32 iteration{' 1 ': 0.2675338708706491, ' 3 ': 0.13227261904986046, ' 2 ': 0.2524037902400518, ' 5 ': 0.062477242064127136, ' 4 ': 0.1697488529161491, ' 6 ': 0.1155828978186352}

(2) using the special structure of Markov matrices

From reference content 4, where $\delta=\frac{1-p}{n}$

That is, to decompose the matrix $a$, we do not need to display the matrix $a$, and then we can solve a linear equation group.

function x = Pagerank1 (G)% PAGERANK1 Google ' s PageRank flushes version 1-hujiawei%if Nargin < 3, p =. 85; endp=0.85;% eliminate any self-referential linksg = G-diag (Diag (g));% c = out-degree, r = in-degree[n,n] = size (g); c = SUM (g,1 %each row ' s sumr = SUM (g,2);%each Col ' sum% Scale column sums to is 1 (or 0 where there are no out links). K = Find (c~=0);D = Spar Se (k,k,1./c (k), n,n);% Solve (i-p*g*d) x = EE = ones (n,1); I = Speye (n,n); x = (i-p*g*d) \e;% normalize so sum (x) = = 1.x = X/su m (x);

(3) Ingenious solution: Inverse iterative algorithm

The ingenious use of the accuracy error in MATLAB results in the $i-a$ of a singular matrix into a nonsingular matrix, while the runtime only has some warning hints, but the results are the same as other algorithms.

function x = pagerank2 (G)% PAGERANK1 Google's PageRank flushes version 2-hujiawei% using inverse according method%if Nargin < 3, p =. 85; endp=0.85;% eliminate any self-referential linksg = G-diag (Diag (g));% c = out-degree, r = in-degree[n,n] = size (g); c = SUM (g,1 %each row ' s sumr = SUM (g,2);%each Col ' sum% Scale column sums to is 1 (or 0 where there are no out links). K = Find (c~=0);D = Spar Se (k,k,1./c (k), n,n);% Solve (i-p*g*d) x = EE = ones (n,1); I = Speye (n,n);% x = (i-p*g*d) \e;delta= (1-p)/n; a=p*g*d+delta;x= (i) \e;% normalize so sum (x) = = 1.x = X/sum (x);

Finally, attach a good code from reference 4 to simulate random surfing generate matrix $g$ code

function [U,g] = surfer (root,n)% surfer Create the adjacency graph of a portion of the web.% [u,g] = Surfer (Root,n) starts at the U RL Root and follows% Web links loop It forms a adjacency graph with n nodes.% U = a cell array of n strings, the URL of the nodes .% G = an n-by-n sparse matrix and G (I,J) =1 if node J is linked to node i.%% Example: [U,g] = surfer (' http://www.harvard.edu ', 500 % also pagerank.%% this function currently super-delegates nonblank defects. (1) The algorithm for% finding links is naive. We ethically look for the string ' http: '.% (2) An attempt to read from a URL this is accessible, but very slow,% might take a unacceptab Ly long time to complete. In some cases,% it may is necessary to have the keyboard-based system terminate matlab.% Key words from such URL can be added to the ski P list in surfer.m.% Initializeclfshgset (GCF, ' DoubleBuffer ', ' on ') axis ([0 N 0 N]) axis Squareaxis ijbox (onset, ' Position ', [. 01]) Uicontrol (' style ', ' frame ', ' units ', ' normal ', ' position ')(Uicontrol)); (' style ', ' frame ', ' units ', ' normal ', ' position '), [.]); t1 = Uicontrol (' style ', ' Text ', ' Units ', ' normal ', ' position ', [. 02.10.94.04], ... ' Horiz ', ' left '); t2 = Uicontrol (' style ', ' text ', ' units ', ' normal ', ' position ', [. 02.02.94.04], ... ' Horiz ', ' left '); slow = Uicontrol (' style ', ' toggle ', ' units ', ' normal ', ...) ' Position ', [.], ' string ', ' slow ', ' value ', 0, quit = Uicontrol (' style ', ' toggle ', ' units ', ' normal ', ...). ' Position ', [.], ' string ', ' Quit ', ' value ', 0; U = cell (n,1); hash = zeros (n,1); G = Logical (Sparse (n,n)); m = 1; U{m} = Root;hash (M) = Hashfun (root); j = 1;while J < N & get (quit, ' value ') = = 0 Try to open a page. Try set (T1, ' string ', sprintf ('%5d%s ', j,u{j})) set (T2, ' string ', '); Drawnow page = Urlread (u{j}); Catch set (T1, ' string ', sprintf (' Fail:%5d%s ', j,u{j})) Drawnow re-enters End If get (slow, ' value ') pause (.) End% Follow the Links from the open page. For f = findstr (' http: ', page); % A link starts with ' http: ' and ends with the NEXT quote. e = min ([findstr (' ", page (f:end)) findstr ('", page (F:end))]; If IsEmpty (e), re-enters, End url = Deblank (page (f:f+e-2)); URL (url< ') = '! '; % nonprintable characters if URL (end) = = '/', url (end) = []; End% look for the links that should is skipped. Skips = {'. gif ', '. jpg ', '. pdf ', '. css ', ' lmscadsi ', ' cybernet ', ... ' search.cgi ', '. Ram ', ' www.w3.org ', ... ' Scripts ', ' Netscape ', ' Shockwave ', ' WebEx ', ' fansonly '}; Skip = any (url== '! ') | Any (url== '? '); k = 0; While ~skip & (K < Length (skips)) k = k+1; Skip = ~isempty (findstr (Url,skips{k})); End If Skip if IsEmpty (findstr (URL, '. gif ')) & IsEmpty (findstr (URL, '. jpg ')) set (T2, ' string ', sprintf (' Skip:%s ', URL) ) Drawnow if get (slow, ' value ') Pause-end-end re-enters-end% Check If page is already in URL list. i = 0; For k = Find (hash (1:m) = = Hashfun (URL)) '; If IsEqual (u{k},url) i = k; Break end, Add a new URL to the graph there if are fewer than n. if (i = 0) & (M < n) m = m+1; u{m} = URL; Hash (m) = Hashfun (URL); i = m;End% Add a new link. If i > 0 G (i,j) = 1; Set (t2, ' string ', sprintf ('%5d%s ', I,url)) line (j,i, ' marker ', '. ', ' markersize ', 6) Drawnow if get (slow, ' value ') pause (. "End" End j = j+1;enddelete (t1) Delete (T2) Delete (slow) set (quit, ' string ', ' close ', ' callback ', ' Close ' (GCF) ', ' value ') , 0)%------------------------function h = hashfun (URL)% almost unique numeric hash code for pages already visited.h = Length ( URL) + 1024*sum (URL); Implementation of PageRank algorithm under 4.MapReduce frame

Using the idea of iteration (or Power method) on the front wiki to implement the PageRank algorithm under the MapReduce framework is simple, you can read the reference content 5 first.

This article Using-mapreduce-to-compute-pagerank more detailed, can refer to

Here's a job for my big data, which requires a simple algorithm on the wiki to implement the PageRank algorithm in the MapReduce framework. The data set is a relationship between Twitter users and can be viewed as a relationship between Web pages, but the TA does not require writing code and running the dataset (there are more than 1G), so the following is an ideal version of the Python version, not validated by the actual large dataset, and Bloggers are not yet able to mapreduce some of the functions in the framework of Python, so a concise and testable PageRank algorithm is implemented.

1. Input output Format

The input of the map function is the < node, the list of edges drawn from the node, where the node is a class, contains its current PageRank value, the output is a < node, the reverse node PageRank value/The total number of reverse node leads to the sum >;

The input of the reduce function is the < node, the total number of reverse node PageRank value/Reverse node extraction edge, and the output is the < node, the edge list from the node, where the node contains the updated PageRank value.

Pseudocode: [Two for a while wrote in English form]

Process the data to the form of {node i:[its adjacent node list],...} While the sum of difference inclusive the NonBlank PageRank values < threshold map ({node i:[its adjacent node list],...}): Map_ output={} for every node J. Adjacent node List:put or sum up {j: (I, PageRank (i)/length (adjacent node list)} into Map_output return map_output reduce (map_output): reduce_output={} for every entry {j: (I, PageRank (i)/length (adjacent node list)} Map_output:put or sum up all values PageRank values for node J and its adjacent node list to Reduce_output return Reduce_ Output2. Sample Demo

Suppose the user 1,2,3,4 is the relationship shown in the following figure:

Suppose there are 2 mapper (A and B) and a reducer (C), the PageRank value of the initial 4 nodes is 0.25

Among them, about the user 1 and 2 of the data is Mappera read and processed, about the user 3 and 4 of the data is Mapperb read and processed [verified, even if a user's data is by different mapper to read, eventually converge to the result is almost]

The input and output of the map are as follows:

Reduce's input output is as follows, the input is 2 mapper output, the result of the output updated the node PageRank value

The reducer is processed and then the result is entered into mapper processing until the iteration exceeds the set value or the sum of the difference of the PageRank values of all nodes obtained after the two iterations (or it can be two norm) less than the set threshold.

3. Experimental results of the example

(1) First of all, using the power method in MATLAB to calculate the results of the example in the case of p=1.0 [its main role is to verify the correctness of the later Python version]

Matlab source code is as follows:

n=4;i=[2, 3 4 3 4 4 1 2];j=[1 1 1 2 2 3 3 4]; G=sparse (I,j,1,n,n); [N,n] = size (g); for j = 1:n L{j} = Find (g (:, j)); C (j) = Length (l{j}); end% power Methodp=1.0;delta = (1-p)/n;x = Ones (n,1)/n;z = zeros (n,1); cnt = 0;while Max (ABS (z)) >. 0001 z = x; x = zeros (n,1); For j = 1:n if C (j) = = 0 x = x + z (j)/N; else X (l{j}) = X (l{j}) + Z (j)/C (j); End end x = p*x + Delta; CNT = cnt+1;endsprintf (' PageRank result: ') x

The results are:

0.10720.35710.21430.3214

(2) The version of page rank in MATLAB did not iterate with the idea of mapreduce, so I wrote another python version of the PageRank algorithm using MapReduce idea (note: Instead of using the Python map and reduce functions to implement it, I use a more easily understood implementation, with a threshold of 0.0001 and a maximum of 100 iterations.

# coding=utf-8__author__ = ' Hujiawei ' __doc__ = ' pagerank MapReduce ' class Node:def __init__ (SELF,ID,PK): Self.id=id Self.pk=pkdef Pk_map (map_input): map_output={} for Node,outlinks in Map_input.items (): For link in Outlinks:size=len ( outlinks) if link in map_output:map_output[link]+= (float) (node.pk) minimal else:map_output[link]= (float) (node.pk) minimal Return map_outputdef pk_reduce (reduce_input): for result in Reduce_input:for Node,value in Result.items (): node.pk+= Valuedef pk_clear (nodes): For node in nodes:node.pk=0def pk_last (nodes): lastnodes=[] for node in Nodes:lastnodes.append ( Node (node.id,node.pk)) return lastnodesdef Pk_diff (nodes,lastnodes): diff=0 to I in range (len (nodes)): Print (' node PK%f , last node PK%f '% (nodes[i].pk, lastnodes[i].pk)) Diff+=abs (nodes[i].pk-lastnodes[i].pk) return diffdef pk_test1 (): Node1 = node (1, 0.25) Node2 = node (2, 0.25) Node3 = node (3, 0.25) NODE4 = node (4, 0.25) nodes = [Node1, Node2, Node3, Node4] Thresho LD = 0.0001 max_iters = for iteR_count in Range (max_iters): Iter_count + + 1 lastnodes=pk_last (nodes) print (' ============ map count%d ================= ' % (iter_count)) in1 = {node1: [Node2, Node3, Node4], node2: [Node3, Node4]} in2 = {node3: [Node1, Node4], node4: [Node2]} MAPOUT1 = Pk_map (in1) mapout2 = Pk_map (in2) for node, value in Mapout1.items (): Print str (node.id) + "' + str (value) for node, Value in Mapout2.items (): Print str (node.id) + ' + str (value) print (' ============ reduce count%d ================= '% (iter _count)) Reducein = [Mapout1, Mapout2] pk_clear (nodes) pk_reduce (Reducein) for node in Nodes:print str (node.id) + ' + str ( node.pk) Diff=pk_diff (nodes,lastnodes) if diff < Threshold:breaki