PageRank algorithm and hits algorithm of link mining algorithm

Source: Internet
Author: User

Reference: http://blog.csdn.net/hguisu/article/details/7996185
more data mining algorithms :https://github.com/linyiqun/DataMiningAlgorithm

Link Analysis

In the link analysis there are 2 classic algorithms, one is the PageRank algorithm, and the other is the hits algorithm, plainly speaking, are doing link analysis. How to do it, continue to look down.

PageRank algorithm

To talk about the function of the PageRank algorithm, we have to start from the search engine, the origin of the PageRank algorithm formally related to this.

Search Engine

The earliest period of the search engine structure, no outside of the 2 core steps, Step1: the establishment of a large database, Step2: the establishment of an index library for pointing to specific information. Then is the user's search operation, how to check it, a very people will think of the method is through the keyword matching method, for example, I would like to enter Zhang San this keyword, then I will be in the resources to check the words containing Zhang San this word, according to the keyword matching method, as long as an article Zhang San the number of occurrences of the more, The more you want to query the target. (But a more impartial approach should be the number of times/The total number of articles, the form of a ratio is obviously fairer). It's right to think it over. All right, keep going down.

Term spam attack

Since I already know the core principle of the search, if I want to make my Web page can appear in the search results more forward position, as long as the page to add more corresponding keywords not ok, such as in the HTML Div write 10,000 three, let it hide this tag, so that the front page is not affected, Then my purpose is not achieved, this is the term spam attack.

PageRank algorithm principle

Now that the keyword matching algorithm is vulnerable to attack, then what is the best way, this is the emergence of the famous PageRank algorithm, as a new page ranking/importance algorithm, the first was written by Google's founder of the algorithm, PageRank algorithm completely abandoned what keyword is not key words, Each page has its own PageRank value, which means that the importance of a page, the higher the PR value, the final position is more forward. How to measure the importance of each page, the answer is another page on his link. In a word, the more pages you have on their content, the more famous your page is. The calculation of the specific PR value is calculated by the PR value of other Web pages, and the simple calculation process is as follows:

Suppose a collection consisting of only 4 pages: A,b,c and D. If all pages are linked to a, then A's PR (PageRank) value will be b,c and D.

Continue to assume that B also has links to C, and D also has links to 3 pages that include a. A page cannot be voted 2 times. So b gives each page a priceticket. With the same logic, D cast only one-third of the votes on the PageRank of a.

In other words, the PR value of a page is divided by the total number of links.

The examples shown illustrate the specific computational process of PageRank.

This is the time when there is a link inside the page, because there may be no links in the 1 pages, and at this time, the probability of jumping to any page is possible. So the final formula turns out this way:

      

Q is called the damping coefficient.

The calculation process of PageRank

The computational process of PageRank is not really complicated, and his mathematical expression is as follows:


is 1-q into a 1-q/n, the process of the algorithm is actually using the principle of power method, and so on the final calculation to converge, also ended.

assuming that the matrix A = Qxp + (11 q) * /n,e is an all-1 unit vector, P is a link probability matrix that links the relationship through a probability matrix representation, A[i][j] indicates that the page I exists to the page J link, converted as follows:

Figure 2 Web Link matrix: Figure 3 Web link probability matrix:

Figure 4 Transpose matrix of P '

Why the matrix to do the transpose operation, originally A[I][J] represents I to J Link, now becomes the J to I link probability, good, the key to remember this is enough. Finally a is calculated, you can understand him as a web link probability matrix, and finally only need to multiply the corresponding page PR value on it.

At this time the initialization vector r[1, 1, 1]; represents the initial Web page's PR value, multiplied by this a probability matrix, the first PR value r[0] ' =a[0][0]*r[0] + a[0][1]*r[1 ] + a[0][2]*r[2], and because A[i][j] This means that the link probability of the J-I web page is exactly what we call the core principle above. The new R-Vector range probability matrix is then iterated until the convergence is calculated.

PageRank Summary

The computational process of PageRank has been cleverly shifted to the calculation of matrices, making the process very streamlined.

Link spam attack

Magic tall feet, I have already known the principle of PageRank algorithm is nothing more than the number of links to rank well, I want to make my own page ranked by the front, as long as a lot of pages, the link to me, not on the line, academic this is called link spam attack. But here is a problem, the PR value is relative, the PR value of their own web page is dependent on the PR value of the point, the PR value of these points if not high, the target page will not be high, so this time, if you want to create a bunch of zombie web pages, all point to my landing page, PR also disappear will be high, So we see the more common means is to put links on the portal site, major forums or similar to Sina, web News Center of the comments of Chinese links, alternative implementation links point to. At present, the direct comparison of this cheating method is not, but more use is trustrank, means the trust ranking detection, first pick out a bunch of trust page to do the reference, and then calculate the PR value of your page, if your Web page itself is general, but the PR value is particularly high, Then there is a good chance that your webpage will be problematic.

HITS

Hits algorithm also as a link analysis algorithm, and the PageRank algorithm in some aspects or more like, the 2 algorithms put together to do a comparison, again good, an obvious difference is hits processing of the Web page is small-scale collection, and he is related to the query, first enter a query q , assuming that the retrieval system returns n pages, the hits algorithm takes 200 of them (assuming values) as sample data for analysis, returning more valuable pages inside.

hits algorithm principle

Hits measures 1 pages with A[i] and h[i] values, a represents the authority authoritative value, and H represents the hub value.

The main idea is that the higher the authoritative value of the Web page I point out, the greater the value of my hub. The larger the hub value that points to my page, the higher the value of my authority. The variables are weighed against each other. The following picture is straightforward:


Figure 3 Hub and authority weights calculation

If you understand the principle of the PageRank algorithm, understanding hits should be easy, the output of the final result is based on the authority authoritative value of the page from high to low.

Hits algorithm description

Specific can be compared to the program I wrote later.

Hits summary

From the link anti-cheating angle to think, hits more vulnerable to link spam attack, because you want to ah, the number of pages, the probability of error will appear to be big.

PageRank algorithm and hits algorithm implementation

Finally, I personally implemented the 2 algorithm, the input data is the same text (each record represents the page I to the Web page J exists link):

1 21 32) 33 1
Algorithms are not too difficult:

Package Datamining_pagerank;import Java.io.bufferedreader;import Java.io.file;import java.io.filereader;import Java.io.ioexception;import java.lang.reflect.array;import Java.text.messageformat;import java.util.ArrayList;/** * PageRank page ranking algorithm tool class * * @author Lyq * */public class Pageranktool {//test input data Private String filepath;//total number of pages private int PA genum;//Link Relationship matrix private double[][] linkmatrix;//each page PageRank value initial vector private double[] pagerankvecor;//number of pages category ArrayList <String> pageclass;public Pageranktool (String filePath) {This.filepath = Filepath;readdatafile ();} /** * Read data from file */private void Readdatafile () {File File = new file (FilePath); arraylist<string[]> DataArray = new arraylist<string[]> (); try {bufferedreader in = new BufferedReader (new FileReader (file)); String str; String[] Temparray;while ((str = in.readline ()) = null) {Temparray = Str.split ("");d Ataarray.add (Temparray);} In.close ();} catch (IOException e) {e.getstacktrace ();} Pageclass = new arraylist<> ();//Statistics Web page classType of seed for (string[] array:dataarray) {for (String S:array) {if (!pageclass.contains (s)) {Pageclass.add (s);}}} int i = 0;int J = 0;pagenum = Pageclass.size () Linkmatrix = new Double[pagenum][pagenum];p Agerankvecor = new Double[pagenu m];for (int k = 0; k < pagenum; k++) {//initial PageRank value for each page is 1pagerankvecor[k] = 1.0;} For (string[] array:dataarray) {i = Integer.parseint (Array[0]); j = Integer.parseint (array[1]);//Set LINKMATRIX[I][J] For 1 for the I page contains links to j pages Linkmatrix[i-1][j-1] = 1;}} /** * Transpose matrix */private void Transfermatrix () {int count = 0;for (double[] array:linkmatrix) {//count page links count = 0;for (do Uble D:array) {if (d = = 1) {count++;}} Divide by probability for (int i = 0; i < Array.Length; i++) {if (array[i] = = 1) {array[i]/= count;}}} Double T = 0;//The matrix to displace as a probability transfer matrix for (int i = 0; i < linkmatrix.length; i++) {for (int j = i + 1; J < Linkmatrix[0].le Ngth; J + +) {t = linkmatrix[i][j];linkmatrix[i][j] = linkmatrix[j][i];linkmatrix[j][i] = t;}}} /** * Calculate PageRank value */public void Printpageran by power methodKvalue () {Transfermatrix ();//damping factor double damp = 0.5;//link probability matrix double[][] A = new Double[pagenum][pagenum];d ouble[][] e = NE  W double[pagenum][pagenum];//Call Formula a=d*q+ (1-d) *e/m,m is the total number of pages, d is dampdouble temp = (1-damp)/pagenum;for (int i = 0; i < E.length; i++) {for (int j = 0; J < E[0].length; J + +) {E[i][j] = temp;}} for (int i = 0, i < pagenum; i++) {for (int j = 0; J < Pagenum; J + +) {temp = damp * Linkmatrix[i][j] + e[i][j]; A[I][J] = temp;}} error value, as a criterion for judging convergence double errorvalue = integer.max_value;double[] Newprvector = new double[pagenum];// If the average PR value error is less than 0.001, the convergence while (Errorvalue > 0.001 * pagenum) {System.out.println ("**********") is reached, and for (int i = 0; i < P Agenum; i++) {temp = 0;//will be a*pagerankvector, solved by the power method until the Pagerankvector value converges for (int j = 0; J < Pagenum; J + +) {//Temp is the page Rank Value temp + = a[i][j] * pagerankvecor[j];} The last temp is the total PageRank value of the I Web page newprvector[i] = temp; SYSTEM.OUT.PRINTLN (temp);} Errorvalue = 0;for (int i = 0; i < pagenum; i++) {errorvalue + = Math.Abs (pageRankvecor[i]-newprvector[i]);//new vector instead of old vector pagerankvecor[i] = newprvector[i];}} String name = Null;temp = 0; System.out.println ("--------------------"); for (int i = 0; i < pagenum; i++) {System.out.println ( Messageformat.format ("PageRank value for page {0}: {1}", Pageclass.get (i), pagerankvecor[i]), if (Pagerankvecor[i] > Temp) { temp = Pagerankvecor[i];name = Pageclass.get (i);}} System.out.println (Messageformat.format ("highest rated page: {0}", name));}}
The following is the implementation of the hits algorithm:

Package Datamining_hits;import Java.io.bufferedreader;import Java.io.file;import java.io.filereader;import Java.io.ioexception;import java.util.arraylist;/** * Hits link analysis algorithm tool class * @author Lyq * */public class Hitstool {//Input data file address PRI vate String filepath;//page number private int pagenum;//Web authority authoritative value private double[] authority;//Web Hub Center value private double[] hub;//link matrix relationship private int[][] linkmatrix;//Web page kind private arraylist<string> pageclass;public Hitstool (String FilePath) {This.filepath = Filepath;readdatafile ();} /** * Read data from file */private void Readdatafile () {File File = new file (FilePath); arraylist<string[]> DataArray = new arraylist<string[]> (); try {bufferedreader in = new BufferedReader (new FileReader (file)); String str; String[] Temparray;while ((str = in.readline ()) = null) {Temparray = Str.split ("");d Ataarray.add (Temparray);} In.close ();} catch (IOException e) {e.getstacktrace ();} Pageclass = new arraylist<> ();//Statistics page type species number for (string[] array:dataarray) {for (String S:array){if (!pageclass.contains (s)) {Pageclass.add (s);}}} int i = 0;int J = 0;pagenum = Pageclass.size (); Linkmatrix = new int[pagenum][pagenum];authority = new Double[pagenum];hub = new Double[pagenum];for (int k=0; k<pagenum; k++) {//initial default authoritative value and center value are 1authority[k] = 1;hub[k] = 1;} For (string[] array:dataarray) {i = Integer.parseint (Array[0]); j = Integer.parseint (array[1]);//Set LINKMATRIX[I][J] For 1 for the I page contains links to j pages Linkmatrix[i-1][j-1] = 1;}} /** * Output Results page, that is, authority the highest authoritative page */public void Printresultpage () {//MAX hub and authority values, used for subsequent normalization of double maxhub = 0; Double maxauthority = 0;int maxauthorityindex =0;//error value for convergence judgment double error = integer.max_value;double[] Newhub = new double [Pagenum];d ouble[] newauthority = new Double[pagenum];while (Error > 0.01 * pagenum) {for (int k=0; k<pagenum; k++) {NE WHUB[K] = 0;newauthority[k] = 0;} Update of hub and authority values calculated for (int i=0, i<pagenum; i++) {for (int j=0; j<pagenum; J + +) {if (linkmatrix[i][j] = = 1) {newhub[ I] + = authority[j];newauthority[j] + = Hub[i];}} Maxhub = 0;maxauthority = 0;for (int k=0; k<pagenum; k++) {if (Newhub[k] > Maxhub) {maxhub = newhub[k];} if (Newauthority[k] > maxauthority) {maxauthority = Newauthority[k];maxauthorityindex = k;}} Error = 0;//normalization processing for (int k=0; k<pagenum; k++) {newhub[k]/= maxhub;newauthority[k]/= maxauthority;error + = Math.Abs (NE WHUB[K]-hub[k]); System.out.println (Newauthority[k] + ":" + newhub[k]); Hub[k] = newhub[k];authority[k] = newauthority[k];} System.out.println ("---------");} System.out.println (the authoritative and central values of the Web page that eventually converges); for (int k=0; k<pagenum; k++) {System.out.println ("page" + Pageclass.get (k) + ":" + authority[k] + ":" + Hub[k]);} System.out.println ("highest authoritative page: Web page" + pageclass.get (Maxauthorityindex));}}
The output of the 2 results is as follows:

PageRank algorithm;

1.00.74999999999999991.25**********1.1250.751.1249999999999998********** 1.06249999999999980.781251.15625**********1.0781250.76562499999999981.1562499999999998********** 1.07812499999999980.76953124999999981.1523437499999998********** 1.07617187499999980.76953124999999981.1542968749999996********** 1.07714843749999960.76904296874999971.1538085937499996-------------------- Page 1 of the PageRank value: 1.077 PageRank value of page 2:0.769 Page 3 of the PageRank value: 1.154 The highest level of the page is: 3
Hits Algorithm:

0.5:1.00.5:0.51.0:0.5--------- 0.3333333333333333:1.00.6666666666666666:0.66666666666666661.0:0.3333333333333333--------- 0.2:1.00.6000000000000001:0.60000000000000011.0:0.2---------0.125:1.00.625:0.6251.0:0.125--------- 0.07692307692307693:1.00.6153846153846154:0.61538461538461541.0:0.07692307692307693--------- 0.04761904761904762:1.00.6190476190476191:0.61904761904761911.0:0.04761904761904762--------- 0.029411764705882356:1.00.6176470588235294:0.61764705882352941.0:0.029411764705882356---------* * * * Final convergence of the authoritative values and center values of the page * * * page 1:0.029411764705882356:1.0 page 2:0.6176470588235294:0.6176470588235294 on page 3 : 1:0.029411764705882356 The most authoritative pages are: Page 3
The result is the highest ranking on page 3.

PageRank algorithm and hits algorithm of link mining algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.