Preparation of hadoop (I): A Preliminary Study of the Page Rank Algorithm

Source: Internet
Author: User

Why did we put page rank in the hadoop study notes? This is because the first week of the hadoop course focused on Google's three major papers (GFS, map-Reduce and Big Table) and the source of hadoop ideas, PR in the solutions of Page Rank and map-ReduceAlgorithmThe idea of how to use distributed computing to process the Page Rank of trillions of webpages has not yet been clarified. Before that, it took some weeks to understand the basic algorithms of page rank. Several articlesArticleThe story is very clear (I think that mathematics is required for the trend, and there is no good mathematics, including linear, high, discrete, and many other paths)

To be honest, the explanation of the Page Rank Algorithm in the training courseware is too abstract, and the matrix does not show why it must be grown like that, for example, what the line means and what the column means, why do we have to multiply the number of columns in four rows and one column, and how does this convergence function (PG) formula come from? Why do we need to multiply it, after listening for three times, I finally found the desired answer here...
Based on the same path as the courseware, the author explains the problems that I have left over when I listen to the courseware,
> Http://blog.codinglabs.org/articles/intro-to-pagerank.html (really well written)

In addition, there are two small web app about the Page Rank Algorithm, you can drag the page relationship, calculate the G value https://googledrive.com/host/0B2GQktu-wcTiaWw5OFVqT1k3bDA, the algorithm is interpreted as the http://www.nowherenearithaca.com/2013/04/explorating-googles-pagerank.html algorithm adding the dead end 1/6 matrix, I do not know if it is necessary, after all, there is a (1-alpha) * 1/page count matrix.

Some people in the group have never understood what Googler was doing to get the Alpha value of 0.85 at that time, and how the following algorithm formula was obtained,

Because the first week of training is to useCodeCalculate the Page Rank. I also tested the necessity of this value in the code.

The values you see in hyperlink matrix are 1/3, 1/3, and 1/3, which indicate the probability that the user will jump to the linked website on the current page, however, if a page is called dead end only when the link is out of the chain (or is completely isolated), the vector of some pages may be 0 in this matrix,
For example, if I change the matrix of the first question so that there is no page link to,
/* A B C D E */
/* A */{0, 0, 0, 0, 0 },
/* B */{1/3f, 0, 0, 0, 0 },
/* C */{1/3f, 0, 0, 1/2f, 1 },
/* D */{1/3f, 1/2f, 0, 0, 0 },
/* E */{0, 1/2f, 1, 1/2f, 0}
Calculate the iteration result directly from the original hyperlink matrix:
Staring iteration 4...
0 0*0 0*0*0.5 0*0 0*0.5 <0>
1 0.3333333*0 0*0 0*0.5 0*0 0*0.5 <0>
2 0.3333333*0 0*0 0*0.5 0.5*0 1*0.5 <0.5>
3 0.3333333*0 0.5*0 0*0.5 0*0 0*0.5 <0>
4 0*0 0.5*0 1*0.5 0.5*0 0*0.5 <0.5>
It can be seen that this only causes PR of B and D to be 0, which is incorrect.

So Googler came up with a possibility that when a user is on a page, there is a very small probability (for example, 1-0.85) that the user will open other pages unrelated to the page. This is called teleporting
Therefore, the 0.85 * hyperlink matrix is followed by (the remaining number is 0.15/page number. As to why/page number is required, it can be understood as a random probability matrix to any page, that is, the matrix of all 1/page number) to make these pages that are not chained have a very small Vector Value, for example, in the first week, G matrix calculates the "offset" probability of these pages as 0.03.
This will not cause problems.

After joining teleporting
Staring iteration 4...
0 0.03*0.02999999 0.03*0.0385 0.03*0.4361937 0.03*0.0548625 0.03*0.4404438 <0.02999999>
1 0.3133333*0.02999999 0.03*0.0385 0.03*0.4361937 0.03*0.0548625 0.03*0.4404438 <0.03849999>
2 0.3133333*0.02999999 0.03*0.0385 0.03*0.4361937 0.455*0.0548625 0.88*0.4404438 <0.4361937>
3 0.3133333*0.02999999 0.455*0.0385 0.03*0.4361937 0.03*0.0548625 0.03*0.4404438 <0.05486249>
4 0.03*0.02999999 0.455*0.0385 0.88*0.4361937 0.455*0.0548625 0.03*0.4404438 <0.4404437>
This is my understanding after reading the article. If you have any different understandings, please correct me.

 

Attach questions and solutions, and use the C # code for processing. Which language is better,
1. the basic process is to set the initial value hyperlink matrix (based on probability). The formula alpha = 0.85g = 0.85 * hyperlink matrix + (1-0.85) /number of pages * 1 matrix get G Matrix
 Note that the sum of each page (column) in the G Matrix cannot exceed 1. Otherwise, the result will be divergent and should be equal to 1 before it can be properly closed.

All subsequent operations are based on the fixed g matrix. Qn + 1 = gqn

2. Convergence and closure condition of iteration end: Euclidean distance calculation method distance and similarity measurement
In addition, the numerical experiment of the initial Vector Array q0 shows that the result is indeed not closely related. The 0.0001 difference is accurate for the last 14 times of 5 1, and the 0.2 difference is accurate for the last 13 times of 5 0.0001, the only link is the multiple of the vector, but the proportion of these pages is the same.

Question:

1. Based on the "4 Web Page Model" provided on page 1 of the slides, assume that there are five webpages, A, B, C, D, and E.
1) webpage A has links to pages B, C, and D.
2) webpage B has a link pointing to a, e
3) There is a link to a, e
4) d the webpage has a link pointing to C
5) The EIP page has links pointing to a, c
A. Please write out the Google matrix of this webpage link structure. Which page do you think is of the highest importance (prvalue?
B. manually or programmatically calculate the prvalue of these five pages. You can use any ofProgramming Language, Welcome to share your own on the ForumProgramAnd result (optional)
C. Where are the main difficulties in PR calculation when there are many pages? How does map-Reduce solve this problem? (Optional)

Using system; namespace consoleapplication1 {class program {static float [,] arrsrcmatrix; static float alpha = 0.85f; static float [] curpagerankmatrix; static int iterationtime; static void main (string [] ARGs) {arrsrcmatrix = new float [5, 5] {/* a B C D E * // * a */{0, 1/2f, 1/2f, 0, 1/2f},/* B */{1/3f, 0, 0, 0, 0 }, /* C */{1/3f, 0, 0, 1, 1/2f},/* D */{1/3f, 0, 0, 0, 0 }, /* E */{0, 1/2f, 1/2f, 0, 0 }}; getgooglematrix (); curpagerankmatrix = new float [5] {0.2f, 0.2f, 0.2f, 0.2f, 0.2f, 0.2f}; iterationtime = 0; double endvalue = 0.20.1d; while (1 = 1) {iterationtime ++; var nextmatrix = doiterate (curpagerankmatrix ); // Euclidean distance (Euclidean distance) Double CNT = 0.00d; For (VAR m = 0; m <curpagerankmatrix. length; m ++) {CNT + = math. pow (nextmatrix [m]-curpagerankmatrix [m], 2);} If (math. SQRT (CNT) <= endvalue) {break;} else {curpagerankmatrix = nextmatrix ;}}} /// <summary> // G = 0.85 * Google matrix + 0.15/page count * One matrix // </Summary> static void getgooglematrix () {for (VAR m = 0; m <= arrsrcmatrix. getupperbound (0); m ++) {console. write (string. format ("{0} \ t", M); For (VAR n = 0; n <= arrsrcmatrix. getupperbound (0); N ++) {arrsrcmatrix [M, N] = arrsrcmatrix [M, N] * Alpha + (1-alpha)/(arrsrcmatrix. getupperbound (0) + 1); console. write (string. format ("{0} \ t", arrsrcmatrix [M, N]);} console. writeline () ;}/// <summary> // current page rank matrix, shall be the number of pages /// </Summary> /// <Param name = "curpagerankmatrix"> </param> static float [] doiterate (float [] curpagerankmatrix) {float [] TGT = new float [curpagerankmatrix. length]; console. writeline ("staring iteration" + iterationtime + "... "); For (VAR m = 0; m <= arrsrcmatrix. getupperbound (0); m ++) {If (M> = TGT. length) break; float cur = 0.0f; console. write (string. format ("{0} \ t", M); For (VAR n = 0; n <= arrsrcmatrix. getupperbound (0); N ++) {cur + = arrsrcmatrix [M, N] * curpagerankmatrix [N]; console. write (string. format ("{0} * {1}", arrsrcmatrix [M, N], curpagerankmatrix [N]);} TGT [m] = cur; console. write (string. format ("<{0}>", TGT [m]); console. writeline () ;}return TGT ;}}}
Calculation Result c: \ Users \ shixun \ Desktop> leleapplication1.exe 0 0.03 0.455 0.455 0.03 0.455 1 0.3133333 0.03 0.03 0.03 2 0.03 0.3133333 0.03 0.03 3 0.88 0.455 0.3133333 0.03 4 0.03 0.03 0.03 0.03 0.455 0.455 staring Iteration 1... 0 0.03*0.2 0.455*0.2 0.455*0.2 0.03*0.2 0.455*0.2 <0.285> 1 0.3133333*0.2*0.03 0.2*0.03*0.2 0.03*0.2 <0.03> 2 0.2*0.2 0.03*0.2 0.03*0.2 0.88*0.2 0.455*0.2 <0.3416667> 3 0.3133333*0.2 0.03*0.2*0.03 0.2*0.03*0.2 <0.03> 4 0.2*0.08666666 0.455*0.2 0.455*0.2*0.03 0.2*0.03 <0.2> staring iteration 2... 0 0.03*0.285 0.455*0.08666666 0.455*0.3416667 0.03*0.08666666 0.455*0.2 <0.2970417> 1 0.3133333*0.285*0.03 0.08666666*0.03*0.3416667 0.03*0.08666666 <0.03> 2 0.2*0.285 0.03*0.08666666 0.03*0.3416667 0.88*0.08666666 0.455*0.2 <0.2694167> 3 0.3133333*0.285 0.03*0.08666666*0.03 0.3416667*0.03*0.08666666 <0.03> 4 0.2*0.11075 0.455*0.08666666 0.455*0.3416667*0.03 0.08666666*0.03 <0.2> staring iteration 3... 0 0.03*0.2970417 0.455*0.11075 0.455*0.2694167 0.03*0.11075 0.455*0.2120417 <0.2816885> 1 0.3133333*0.2970417*0.03 0.11075*0.03*0.2694167 0.03*0.11075 <0.03> 2 0.2120417*0.2970417 0.03*0.11075 0.03*0.2694167 0.88*0.11075 0.455*0.2120417 <0.298417> 3 0.3133333*0.2970417 0.03*0.11075*0.03 0.2694167*0.03*0.11075 <0.03> 4 0.2120417*0.1141618 0.455*0.11075 0.455*0.2694167*0.03 0.11075*0.03 <0.2120417> staring iteration 4... 0 0.03*0.2816885 0.455*0.1141618 0.455*0.298417 0.03*0.1141618 0.455*0.1915708 <0.2867636> 1 0.3133333*0.2816885*0.03 0.1141618*0.03*0.298417 0.03*0.1141618 <0.03> 2 0.1915708*0.2816885 0.03*0.1141618 0.03*0.298417 0.88*0.1141618 0.455*0.1915708 <0.2882669> 3 0.3133333*0.2816885 0.03*0.1141618*0.03 0.298417*0.03*0.1141618 <0.03> 4 0.1915708*0.1098117 0.455*0.1141618 0.455*0.298417*0.03 0.1141618*0.03 <0.1915708> staring iteration 5... 0 0.03*0.2867636 0.455*0.1098117 0.455*0.2882669 0.03*0.1098117 0.455*0.205346 <0.2864555> 1 0.3133333*0.2867636*0.03 0.1098117*0.03*0.2882669 0.03*0.1098117 <0.03> 2 0.205346*0.2867636 0.03*0.1098117 0.03*0.2882669 0.88*0.1098117 0.455*0.205346 <0.2918617> 3 0.3133333*0.2867636 0.03*0.1098117*0.03 0.2882669*0.03*0.1098117 <0.03> 4 0.205346*0.1112497 0.455*0.1098117 0.455*0.2882669*0.03 0.1098117*0.03 <0.205346> staring iteration 6... 0 0.03*0.2864555 0.455*0.1112497 0.455*0.2918617 0.03*0.1112497 0.455*0.1991834 <0.2859753> 1 0.3133333*0.2864555*0.03 0.1112497*0.03*0.2918617 0.03*0.1112497 <0.03> 2 0.1991834*0.2864555 0.03*0.1112497 0.03*0.2918617 0.88*0.1112497 0.455*0.1991834 <0.2903775> 3 0.3133333*0.2864555 0.03*0.1112497*0.03 0.2918617*0.03*0.1112497 <0.03> 4 0.1991834*0.1111624 0.455*0.1112497 0.455*0.2918617*0.03 0.1112497*0.03 <0.1991834> staring Iteration 7... 0 0.03*0.2859753 0.455*0.1111624 0.455*0.2903775 0.03*0.1111624 0.455*0.2013223 <0.2862164> 1 0.3133333*0.2859753*0.03 0.1111624*0.03*0.2903775 0.03*0.1111624 <0.03> 2 0.2013223*0.2859753 0.03*0.1111624 0.03*0.2903775 0.88*0.1111624 0.455*0.2013223 <0.2910763> 3 0.3133333*0.2859753 0.03*0.1111624*0.03 0.2903775*0.03*0.1111624 <0.03> 4 0.2013223*0.1110263 0.455*0.1111624 0.455*0.2903775*0.03 0.1111624*0.03 <0.2013223> staring iteration 8... 0 0.03*0.2862164 0.455*0.1110263 0.455*0.2910763 0.03*0.1110263 0.455*0.2006544 <0.2861718> 1 0.3133333*0.2862164*0.03 0.1110263*0.03*0.2910763 0.03*0.1110263 <0.03> 2 0.2006544*0.2862164 0.03*0.1110263 0.03*0.2910763 0.88*0.1110263 0.455*0.2006544 <0.2907452> 3 0.3133333*0.2862164 0.03*0.1110263*0.03 0.2910763*0.03*0.1110263 <0.03> 4 0.2006544*0.1110946 0.455*0.1110263 0.455*0.2910763*0.03 0.1110263*0.03 <0.2006544> staring iteration 9... 0 0.03*0.2861718 0.455*0.1110946 0.455*0.2907452 0.03*0.1110946 0.455*0.2008936 <0.2861617> 1 0.3133333*0.2861718*0.03 0.1110946*0.03*0.2907452 0.03*0.1110946 <0.03> 2 0.2008936*0.2861718 0.03*0.1110946 0.03*0.2907452 0.88*0.1110946 0.455*0.2008936 <0.2908922> 3 0.3133333*0.2861718 0.03*0.1110946*0.03 0.2907452*0.03*0.1110946 <0.03> 4 0.2008936*0.111082 0.455*0.1110946 0.455*0.2907452*0.03 0.1110946*0.03 <0.2008936> staring iteration 10... 0 0.03*0.2861617 0.455*0.111082 0.455*0.2908922 0.03*0.111082 0.455*0.2007819 <0.2861714> 1 0.3133333*0.2861617*0.03 0.111082*0.03*0.2908922 0.03*0.111082 <0.03> 2 0.2007819*0.2861617 0.03*0.111082 0.03*0.2908922 0.88*0.111082 0.455*0.2007819 <0.2908311> 3 0.3133333*0.2861617 0.03*0.111082*0.03 0.2908922*0.03*0.111082 <0.03> 4 0.2007819*0.1110791 0.455*0.111082 0.455*0.2908922*0.03 0.111082*0.03 <0.2007819> staring iteration 11... 0 0.03*0.2861714 0.455*0.1110791 0.455*0.2908311 0.03*0.1110791 0.455*0.200839 <0.2861685> 1 0.3133333*0.2861714*0.03 0.1110791*0.03*0.2908311 0.03*0.1110791 <0.03> 2 0.200839*0.2861714 0.03*0.1110791 0.03*0.2908311 0.88*0.1110791 0.455*0.200839 <0.2908558> 3 0.3133333*0.2861714 0.03*0.1110791*0.03 0.2908311*0.03*0.1110791 <0.03> 4 0.200839*0.1110819 0.455*0.1110791 0.455*0.2908311*0.03 0.1110791*0.03 <0.200839> staring iteration 12... 0 0.03*0.2861685 0.455*0.1110819 0.455*0.2908558 0.03*0.1110819 0.455*0.2008119 <0.2861685> 1 0.3133333*0.2861685*0.03 0.1110819*0.03*0.2908558 0.03*0.1110819 <0.03> 2 0.2008119*0.2861685 0.03*0.1110819 0.03*0.2908558 0.88*0.1110819 0.455*0.2008119 <0.2908457> 3 0.3133333*0.2861685 0.03*0.1110819*0.03 0.2908558*0.03*0.1110819 <0.03> 4 0.2008119*0.1110811 0.455*0.1110819 0.455*0.2908558*0.03 0.1110819*0.03 <0.2008119> staring iteration 13... 0 0.03*0.2861685 0.455*0.1110811 0.455*0.2908457 0.03*0.1110811 0.455*0.2008235 <0.2861689> 1 0.3133333*0.2861685*0.03 0.1110811*0.03*0.2908457 0.03*0.1110811 <0.03> 2 0.2008235*0.2861685 0.03*0.1110811 0.03*0.2908457 0.88*0.1110811 0.455*0.2008235 <0.29085> 3 0.3133333*0.2861685 0.03*0.1110811*0.03 0.2908457*0.03*0.1110811 <0.03> 4 0.2008235*0.1110811 0.455*0.1110811 0.455*0.2908457 0.03*0.1110811 0.03*0.2008235 <0.2008189>

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.