--pagerank algorithm MapReduce implementation of "Big Chuang _ Community Division"

Source: Internet
Author: User

PageRank algorithm analysis and Python implementation reference: http://blog.csdn.net/gamer_gyt/article/details/47443877

For example :

Assume that each page has its own default PR value, which is equivalent to being added to it as an attribute that identifies the level or importance of a Web page and thus achieves ranking purposes based on this identity. Suppose there is an ID number is 1 of a page, the PR value is 10, if it produced to Id=3,id=6,id=8, id=9 these 4 web links. So it can be understood that the Id=1 Web page to id=3,6,8,9 's 4 pages each contributed 2.5 of the PR value. If you want any Web page to assume that its id=3 PR value, you need to get all the other pages on id=3 this page of the sum of contributions, and then according to the function "PR" = "sum" *0.85+0.15 get. The Web page PR value, which can be used as the index of page rank, is calculated through the process of multiple loops.

1: Prepare data

The theoretical data is the information of all the pages in a particular closed-systems through a web crawler, and in order to test the program, it can generate its own data that defines a particular format. Here is the data I used to test, how to store it


(Note: For custom simulation data, when the PR initial value of the selection, all the pages are "equal", will not say that their own pages and Google's popular web page there is a lot of difference, but according to a certain rule after a certain calculation of PR is not the same, such as many other pages may be linked to Google, Its PR will naturally be higher than yours. So the initial value of the selection according to this logic is realistic, that is, all the Web page default PR value is equal. But even if the initial value is not the same, the overall PR sum of the whole system may change, the last page PR will also change, but this amount of change, will not affect the web itself through the comparison of the size of the logical ranking on the way.

2:mapreduce process

The data format accepted by map defaults to the < offset, text line > Such <key,value> pair, shaped like <0,1 5 2 3 4 5><20,2 3 5 8 9>.

Goal: Convert the default data format to custom format <key,value> pair.

Known: The Hadoop framework automatically implements the sort process during the map phase by saving all the value of the same key to the list, such as <key,list (1,1,1,2) > In this form, such as the id=1,6 of the id=2 page above, 7,8. These 4 pages contribute (1.25,1,5/3,5), then if you are using a key that is a page ID, then the Hadoop framework <2,list (1.25,1,5/3,5) > output in this form is passed to reduce.

Reduce phase:

Analysis: This phase of the operation is relatively simple, read the map output <key,value>, and parse it out.

Action: Add the numbers in values to the PageRank value of the corresponding ID.

Results such as:


The code is as follows


Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

--pagerank algorithm MapReduce implementation of "Big Chuang _ Community Division"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.