Calculate the PageRank value of Wikipedia

Source: Internet
Author: User

The first time I wrote a csdn blog, I felt that its xheditor was not so easy to use. This article is a job in our wbia (Web Based Information Architecture) course, it is directly pasted from the job report word, and it is not smooth during the paste process. First, the image can be pasted but uploaded and added by yourself. Second, the format can be pasted and there are some minor issues to be changed. In any case, if you write a csdn blog for the first time, I hope you will forgive me. If you want the source code, please leave a comment on it.

Job Requirements: http://net.pku.edu.cn /~ Wbia/2012 fall/project1.html, question 1 PageRank. This project is implemented in Java.

1. Key points of the Project 1.1 PageRank value calculation formula

This project uses the random browse model to calculate the PageRank value. The formula is as follows:

D indicates the probability of browsing by hyperlink. 1-D indicates the probability that the user will jump to a new web page at random. The initial PageRank value is set to 1 to ensure that the total PageRank value remains unchanged during each iteration.

1.2 convergence of PageRank in Iterative Computing

When pages are updated through PageRank iteration, if the new and old PageRank values before and after the update are smaller than a specified threshold value, the PageRank page converges. This project provides a max_pagerank_error threshold. If the PageRank value difference between the new and old pages is smaller than the threshold and the number of pages is smaller than max_error_page, the iteration converges and the algorithm ends. In the calculation result, when max_pagerank_error = 0.001, max_error_page = max_page/1000 + 1 = 12, the algorithm converges after 37 iterations.

1.3 suspension node Problems

The dangling node is a node with an outbound degree of 0, that is, the node does not have any links to other nodes. In this way, when PageRank value is calculated iteratively, other page nodes vote to allocate some of their PageRank values to the node, because the node does not have a link to another node, you do not need to assign the PageRank value to another node. In this way, with the iteration of PageRank, the hanging nodes gradually absorb the PageRank value, resulting in PageRank black hole, which is unfair to other nodes.

There are two main ways to handle hanging nodes. The first method is mentioned in the PageRank paper published by Google, specifically: Delete the suspension link to the suspension node (this may make a non-suspension node A new suspension node, so iterative deletion is required), calculate the PageRank value of the remaining non-hanging nodes until convergence, then add the deleted suspension link to the linkgraph, and recalculate PageRank until convergence. This method reduces the absorption of PageRank value by the suspension node. The second method is to create a virtual node so that all the hanging nodes can establish a link pointing to the virtual node, and then create a link for the virtual node to all the other nodes respectively. In this way, each iteration calculates the PageRank value, the PageRank value absorbed by the suspension node is allocated to the virtual node, and then the virtual node distributes the PageRank value to all other nodes. This project adopts the second method.

 

2. Project Implementation 2.1 Overall project framework

This project includes pagerankonwiki, linkgranphmaker, pagerankcalculator, and topnpagerank. The pagerankonwiki class is the main class, including the main () function and parameter variables of some projects. There are three classes for calling this class. The overall Class View of the project is as follows.

2.2 pagerankonwiki class

The pagerankonwiki class is the main class of this project. It contains some project parameters and variables, and the remaining three classes are called in the main function of the class.

2.3 linkgraphmaker class

Running. This class has three main methods: buildrawgraph (), cleangraph (), and printgraph ().

Buildrawgraph () reads data from the file smallwiki-flattened.xml, extracts the page and the corresponding link information, and establishes the initial linkgraph. The page uses the regular expression <title> (. + )?) </Title>, while link uses regular [[[^ |] +) (]) | (| [^] +]) extraction (Note: Write the two regular expressions as Java to perform corresponding transformations. Some parentheses in them are used to facilitate variable extraction later ). This method also requires the creation of virtual nodes.

Cleangraph () processes the linkgraph created by buildrawgraph (). If the page corresponding to the link is not in this file, delete it (such as link to some images or non-Wikipedia data ). Determine whether the node is a suspension node. If the node is a suspension node, create a link to the virtual node.

Printgraph( print the established linkgraphto the linkgraph.txt file.

2.4 pagerankcalculator class

Running. This class mainly includes methods calpagerank (), initpagerankvalue (), iterationforpagerank (), and printpagerank ().

The calpagerank () method is the main method of this class. In this method, the initpagerankvalue () method is called to initialize some related variables of PageRank, And the iterationforpagerank () method is called cyclically () calculate the PageRank value iteratively until the algorithm converges, and finally call printpagerank () to print the PageRank value.

The initpagerankvalue () method initializes some variables that calculate the PageRank value.

Iterationforpagerank () updates the PageRank value. The calculation method is as follows. Each page node evenly distributes its PageRank value to its linked page node. Each page node calculates the total number of PageRank values allocated to it (prfragment, the formula PageRank = (1-D) + D * prfragment is used for updating. To reduce the number of stored links, linkgraph does not create a real link for each virtual node. All links must be processed separately for the virtual node, assign the PageRank value to all other nodes. In addition, the PageRank values before and after the update are compared. If the number of page nodes whose difference value exceeds max_pagerank_error does not exceed max_error_page, the algorithm converges, the return program end flag is calpagerank () to notify the algorithm end.

Printpagerank( print the pagerankvalue of each node to pagerank.txt.

2.5 topnpagerank class

This class sorts the PageRank values calculated by the pagerankcalculator class from large to small, and outputs The Top N values (set topn to 10, which can be changed ). The methods include sortpage () and printpagerank ().

Sortpage () implements the comparable interface to sort PageRank values.

Printpagerank(← print the pagenode of the Top N pagerankvalue to topnpagerank.txt, where the virtual node is not printed.

3. The link graph in 3.1linkgraph.txt of the output result is shown as follows:

This figure is obtained by traversing the hash table. The first three digits of each row are the node serial numbers, the row numbers stored in the linkgraph, the number of node connections, and the page node title, it is followed by the Page Link.

3.2pagerank.txt PageRank value:


Here, iterationnumber indicates the number of iterations, and max_iteration_number indicates the maximum number of iterations allowed. Therefore, the end of this project is the end of convergence.

3.3 top 10 pages of PageRank in topnpagerank.txt

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.