I didn't want to do other things last night. I suddenly remembered that I hadn't updated my blog for a long time. shell had almost finished learning, but I only went out with the book when I learned it. Hadoop, Miss Huang Yihua has finished speaking, and he has quickly finished learning it. He has not summarized it. So today I will write a PageRank code about the English Wiki written some time ago.
What is PageRank ABC?
PageRank is a technology used by search engines to calculate the ranking of webpages based on the links between webpages.
PageRank is a method that Google uses to identify the level or importance of a webpage. Its level ranges from 1 to 10. The higher the prvalue, the more popular the page is (the more important ).
PageRank's basic design ideas and principles
Most of the webpages linked by many high-quality webpages are also high-quality webpages.
Conditions for a webpage to have a higher PR value:
- There are many web pages linked to it;
- There are high-quality web pages to link to it
PageRank Simplified Model
We can regard the links between webpages on the Internet as a directed graph.
For any web page Pi, its PageRank value can be expressed:
Bi is a collection of all web pages linked to web page I, and Lj is an external link of Web Page j.
Defects in simplified models
The actual network hyperlink environment is not so idealistic, PageRank will face two problems:
- Rank leak: When the formula above is used, all PageRank values after several cycles become 0.
- Rank sink: a set of closely linked web pages in the entire web page graph generate Rank sink if there is no outbound link, that is, after several cycles, the PR value of webpages not in the ring will change to 0.
Solution to the defects of simplified models -- use the random browsing model:
Suppose a netman browses from a random web page.
The netsher keeps clicking the link of the current webpage to start next browsing.
However, netusers get tired of starting a random web page.
The probability that a random netuser accesses a new webpage in the preceding way is equal to the PageRank value of the webpage.
This random model is closer to the user's browsing behavior.
PR after using the random browse ModelThe value calculation formula is changed:
Where d is the probability of browsing by hyperchain, 1-d is the probability that the user jumps to a new page at random. Obviously, the probability of redirecting to each page is (1-d)/N, N indicates the number of all webpages.
According to the above formula, after about 10 computations, each webpage can obtain a stable PR value (mathematical proof using the Markov Chain convergence theorem, not very familiar, interested can follow)
Use MapReduce to implement PageRank
Obviously, this kind of work that requires a lot of accumulated computing is suitable for MapReduce. How can this work be done? This task is completed in three phases:
Phase1: GraphBuilder
Create a hyperlink between webpages
Phase2: PageRankIter
Calculate the PageRank value of each web page by Iteration
Phase3: RankViewer
Input values by PageRank from large to small
Specific design:
Each line in En_wiki corresponds to one page. The page title is included in "& lttitle & gtName of thearticle & lt/title & gt, the title of each page is enclosed in a pair of double brackets ("[Name of other article]"). What we need to do is to use PageRank algorithm to give the importance of each page, and output from high to low according to importance. The first thing to do is to build a link relationship graph for each page. This is implemented by the GraphBuilder class. The structure is similar to the graph's adjacent table representation; the second step is to use the PageRank algorithm to obtain the PRValue of each page after a certain number of iterations. The specific algorithm is implemented in the PageRankler class. The third step needs to display the result, and output in descending order according to the format of "PRvalue title". This function is implemented using PageRankViewer. Finally, the entire algorithm needs to be organized, which is a function completed by the main function.
The following describes the specific implementation of each class:
1. Overall Class View
2. GraphBuilder class
Map to complete the construction of the link graph. The first problem to be solved is how to extract the title and linktitle. We use a regular expression to do this :"(? <= & Lttitle & gt )(.*?) (? = & Lt/title & gt) "to match the title, use "(? <=\\ [\ [) (. + ?) (? = \]) To match the linktitle. Note that in the second matching mode, because '[' and ']' Are metacharacters in the regular expression, escape them, in a Java string, an escape character is required. Therefore, two backslashes are generated. After extraction, the key is set to title. The initial PRValue and the chain-out list are transmitted in the format of "PRValue % link1 link2 link3. The key and value types are both Text.
In the reduce stage, you do not need to do any additional work. Instead, you can use the default reduce function to launch the key-value teams in the map stage.
The main11 function sets the parameters of the entire job and runs the job. If the default reduce function is used, you need to set the map output key, value, and output key and value types, otherwise, a Type Mismatch Error may occur. Because the output file is also used as the input in the next job, the output type is set to the binary type (SequenceFileOutputFormat). After the input format is set to binary, map can identify the key and value by itself. In addition, you need to delete the output directory and run the job.
3. PageRankler class
The input and output keys and values of map are of the Text type. You need to maintain the structure and pie of the graph, divide your PRValue into the title of each chain. Get your own PRValue and chain title from the read value, because the value format is: "PRValue % link1 link2 link3... therefore, the PRValue and linktitle Arrays can be obtained after two splits. Assign a PRValue to each element in the linktitle array and send it out. The key to be transmitted is linkTitle, the value is PRValue/linktitle. length (), and the structure of the figure is "link1 link2 link3 ..." Launch.
You need to customize the combiner by yourself, because it is possible that the same linktitle appears on a page. If the structure of the sent graph is not processed, it is directly sent out. If not, it is simply accumulated and then transmitted.
In the redece stage, the PRValue obtained from each link source is combined to obtain the PRvalue. Then, newPR is obtained based on the formula: newPR = (1-d) + d * PRValue, note that the probability of random browsing is set to 1-d instead of (1-d)/N, which is mainly a relatively large value of N and can be ignored if (1-d)/N is small. At the beginning, I tried to get N in map in GraphBuilder, set a global static variable, and add one to the variable when map processes a row, this mechanism works on the local machine, but after arriving at the cluster, all the obtained prvalues are infinite, and then I realized that map is made on each node, so the N value will not change, but the initial value is 0. Then I want to write a job and get N. I find it meaningless and discard it. After obtaining a new PRValue, set the value to "PRValue % link1 link2 link3..." for the next job to use.
The isDouble function determines whether a string is a floating point number. This function is used in the map stage to determine whether a value is a graph structure or a distributed PRValue.
Main11 sets the entire job, and the input and output formats are set to binary. Other tasks are basically the same as those of main11 in the previous stage.
4. PageRankViewer class
In the map stage, we extract the PRValue from the value, set the key to PRValue, set the value to title, and transmit the data.
Because it is required to output data in descending order, you need to write your own Comparator. You only need to return the opposite value of the return value of the relevant function of the parent class.
The main11 function sets the types of related parameters and input and output values, basically the same class.
5. main Function
The main function controls all job flows, including setting input/output paths and parameter checks, and setting PageRank iterations.
Running result:
In this experiment, we set the initial PRValue of all titles to 0.5, the damping factor to 0.85, and the number of PageRank iterations to 10, shows the first 30 title lists and PRValue values of running results on the cluster:
The code will not be pasted. It should be easy to implement according to this idea.