Google PageRank algorithm

Source: Internet
Author: User

1. Google PageRank algorithm

1.1 PageRank Concept
The search engine in the early stages of Internet development sorts web pages based on the number of occurrences of search phrases on the page (occurence ), use the page length and HTML Tag importance prompts to modify the weight. Link Popularity determines the importance of the current page by linking other documents to the current page (inbound links, in this way, you can effectively resist artificially crafted webpage spoofing search engines.

PageRank calculates the importance of a page and assigns different weights to each inbound. The more important the page is, the higher the page is. The importance of the current page is determined by the importance of other pages.

1.2 PageRank algorithm 1

Pr (A) = (1-D) + d (Pr (T1)/C (T1) +... + Pr (TN)/C (TN ))
Where: Pr (a): page a's webpage level,
Pr (Ti): Page Ti webpage level. Page Ti links to page,
C (Ti): the number of links from the page Ti chain,
D: damping factor. The value ranges from 0 to 1.

It can be seen that 1) this algorithm is not sorted by sites, and the page level is determined by independent pages; 2) the page level is determined by the page level of the link to its page, however, the contribution of each link to the page is different. If the Ti page has more links, the less contribution it makes to the current page. The more pages a links to, the higher the page level. 3) the damping factor reduces the contribution of other pages to the sorting of current page.

It can be seen that 1) this algorithm is not sorted by sites, and the page level is determined by independent pages; 2) the page level is determined by the page level of the link to its page, however, the contribution of each link to the page is different. If the Ti page has more links, the less contribution it makes to the current page. The more pages a links to, the higher the page level. 3) the damping factor reduces the contribution of other pages to the sorting of current page.

1.3 random surfing Model
Lawrence Page and Sergey Brin propose a random surfing model for user behavior to explain the above algorithm. They regard users' clicking on a link as a random behavior that does not care about the content. The probability that a user clicks a link on a page is determined by the number of links on the page. This is also the reason for PR (Ti)/C (TI. The probability of a page arriving through random surfing is the sum of the probability of clicking a link that is linked to another page. The introduction of the damping factor D is because it is impossible for users to click the link infinitely and they often jump into another page randomly due to fatigue. D can be regarded as the probability that the user clicks infinitely. (1-D) is the webpage level of the page itself.

1.4 PageRank algorithm 2 (revision of algorithm 1)

Pr (A) = (1-D)/n + d (Pr (T1)/C (T1) +... + Pr (TN)/C (TN ))
N indicates the number of all webpages on the Internet.

Therefore, a probability distribution is formed for the webpage levels of all pages. The sum of the webpage levels of all pages is 1. In algorithm 1, the probability of random surfing to access a page is determined by the total number of pages on the Internet. In algorithm 2, the webpage level is the expected value of a page being randomly accessed.
The following explains that all algorithms are based on Algorithm 1, which is simple in computing because N values are not considered.

Therefore, a probability distribution is formed for the webpage levels of all pages. The sum of the webpage levels of all pages is 1. In algorithm 1, the probability of random surfing to access a page is determined by the total number of pages on the Internet. In algorithm 2, the webpage level is the expected value of a page being randomly accessed.
The following explains that all algorithms are based on Algorithm 1, which is simple in computing because N values are not considered.

1.5 PageRank features
The sum of the page levels equals the total number of pages on the Internet. When the number of web pages is relatively small, the web page level equation can be solved. In the face of hundreds of millions of web pages on the Internet, it is impossible to solve the equation again.

Here the damping factor is set to 0.5, although Lawrence Page and Sergey Brin actually set it to 0.85.

Pr (A) = 0.5 + 0.5 PR (c)
Pr (B) = 0.5 + 0.5 (Pr (a)/2)
Pr (c) = 0.5 + 0.5 (Pr (a)/2 + Pr (B ))
Solution:
Pr (A) = 14/13 = 1.07692308
Pr (B) = 10/13 = 0.76923077
Pr (c) = 15/13 = 1.15384615
Include:
Pr (A) + Pr (B) + Pr (c) = 3

1.6. Iterative PageRank
Google uses an approximate iteration method to calculate the webpage level of a webpage, that is, first giving each webpage an initial value, and then using the formula above, perform a limited number of operations cyclically to obtain the approximate webpage level. According to articles published by Lawrence Page and Sergey Brin, they actually need to perform 100 iterations to get a satisfactory webpage level value for the entire internet, this example only takes over 10 times. During the iteration process, the sum of the webpage levels of each web page converges to the number of pages on the entire network. Therefore, the average webpage level of each page is 1, and the actual value is between (1-D) and (DN + (1-D.

Iterations

Pr ()

Pr (B)

Pr (c)

0

1

1

1

1

1

0.75

1.125

2

1.0625

0.765625

1.1484375

3

1.07421875

0.76855469

1.15283203

4

1.07641602

0.76910400

1.15365601

5

1.07682800

0.76920700

1.15381050

6

1.07690525

0.76922631

1.15383947

7

1.07691973

0.76922993

1.15384490

8

1.07692245

0.76923061

1.15384592

9

1.07692296

0.76923074

1.15384611

10

1.07692305

0.76923076

1.15384615

11

1.07692307

0.76923077

1.15384615

12

1.07692308

0.76923077

1.15384615

1.7. Implementation of Google search engine webpage level
The level of a webpage is determined by three factors: the page-specific factors, the text of the ingress anchor, and the page level.
Webpage-specific factors include the content, title, and URL of the webpage.
To provide the search results, Google calculates the IR value of the webpage based on the specific factors of the webpage and the text of the ingress anchor. This value is weighted by the position and importance of the retrieved item on the page, to determine the relevance between the webpage and the retrieval request. The IR value and Webpage-level union mark indicate the basic importance of the webpage. There are many ways to combine these two values, but obviously they cannot be added.
Because the webpage level only has a significant impact on the retrieval requests of a single non-specific word, the grading standard of content relevance has a greater impact on the retrieval requests composed of multiple retrieval words.

1.8 use Google toolbar to display the webpage level of the current page
Google Toolbar is an Internet Explorer plug-in developed by Google. It needs to be downloaded and installed from Google. Note: The webpage-level function is its advanced function. At this time, the user information is automatically collected and the toolbar is automatically upgraded.
The webpage level displayed in this toolbar is divided into 11 levels from 0 to 10. If we use (Nd + (1-D) for calculation, assume d = 0.85, it is estimated that the logarithm of the actual network level is the display level, and the base of the logarithm is between 6-7.

Google's directory service can display the website level
The level is divided into 7 levels. Someone compared the two levels.

1.9 impact of the inbound link on the computing Page Level
The inbound link can always increase the current page level, especially when the current page and its lower-level pages constitute a loop, this contribution is greater. For example, in the legend on the right, set the initial level of each page of ABCD to 1, the damping factor to 0.5, and PR (X)/C (X) to 10. Easy to calculate

Pr (A) = 19/3 = 6.33
Pr (B) = 11/3 = 3.67
Pr (c) = 7/3 = 2.33
Pr (d) = 5/3 = 1.67

If a is not on the road, only 0.5*10 = 5 is returned.
The larger the damping factor, the larger the page-level benefits, and the larger the benefits can be received on the entire loop (that is, the inbound chain benefits are evenly distributed to each loop page. For the above example, if you change the damping coefficient to 0.75

Pr (A) = 419/35 = 11.97
Pr (B) = 323/35 = 9.23
Pr (c) = 251/35 = 7.17
Pr (d) = 197/35 = 5.63

In addition to the obvious increase in the level value of each page on the road back, the value of Pr (a)/PR (d) can be significantly reduced.
The sum of the level values of all pages on the entire return path of the inbound link can be obtained by the following formula.

(D/(1-D) × (Pr (X)/C (x ))

This formula can be simply deduced.
1.10 influence of outbound links on computing Page Level
Adding an outbound link does not affect the overall level of the Web, but the loss of a site is equal to the sum of the added values of the sites to which the link is linked. For two closed sites, when one site is linked to another site, the increase and decrease are all (D (/(1-D) × (Pr (X) /C (x )). if the two sites are connected to each other, this value is reduced. The random surf model can be used to explain this phenomenon, that is, the increase of the outbound chain reduces the probability of users accessing the pages on the site. For example, if you set the damping coefficient to 0.75

Pr (A) = 0.25 + 0.75 PR (B)
Pr (B) = 0.25 + 0.375 PR ()
Pr (c) = 0.25 + 0.75 PR (d) + 0.375 PR ()
Pr (d) = 0.25 + 0.75 PR (c)
D:
Pr (A) = 1, 14/23
Pr (B) = 1, 11/23
Pr (c) = 1, 35/23
Pr (d) = 32/23
Pr (A) + Pr (B) = 25/23
Pr (c) + Pr (d) = 67/23
Pr (A) + Pr (B) + Pr (c) + Pr (d) = 92/23 = 4

Page and Brin call such a link a pendulum chain, which links to a page without an outbound link. The suspension chain has a negative impact on page-level computing. For example, the damping coefficient is 0.75.

Pr (A) = 0.25 + 0.75 PR (B)
Pr (B) = 0.25 + 0.375 PR ()
Pr (c) = 0.25 + 0.375 PR ()
D:
Pr (A) = 1, 14/23
Pr (B) = 1, 11/23
Pr (c) = 1, 11/23
Pr (A) + Pr (B) + Pr (c) = 36/23 <3

According to pages and Brin, Google has a large amount of hanging chains when indexing pages,
The primary cause is the restriction on robot.txt and the indexing of some file classes without chain
Type, such as PDF. To eliminate this negative impact, Google
The class link is removed from the database. After calculation, the chain of the suspension chain is calculated separately.
To the page. It can be seen that PDF files can still be released online with confidence.

1.11. Page quantity impact
First look at the example. The damping coefficient is 0.75, Pr (X)/C (x) = 10, then

Pr (A) = 0.25 + 0.75 (10 + Pr (B) + Pr (c ))
Pr (B) = Pr (c) = 0.25 + 0.75 (Pr (a)/2)
D:

Pr (A) = 1, 260/14
Pr (B) = 1, 101/14
Pr (c) = 1, 101/14
Pr (A) + Pr (B) + Pr (c) = 33;
Add Page D;
Pr (A) = 0.25 + 0.75 (10 + Pr (B) + Pr (c) + Pr (d ))
Pr (B) = Pr (c) = Pr (d) = 0.25 + 0.75 (Pr (a)/3)
Get
Pr (A) = 1, 266/14
Pr (B) = 1, 70/14
Pr (c) = 1, 70/14
Pr (d) = 70/14
Pr (A) + Pr (B) + Pr (c) + Pr (d) = 34

After the page is added, the sum of the level values of all pages increases by 1, page a increases slightly, and page B and page c decrease significantly.
Let's look at the example on the right.

Pr (A) = 0.25 + 0.75 (10 + Pr (c ))
Pr (B) = 0.25 + 0.75 × PR ()
Pr (c) = 0.25 + 0.75 × PR (B)
D:
Pr (A) = 517/37 = 13.97
Pr (B) = 397/37 = 10.73
Pr (c) = 307/37 = 8.30

Add Page D:
Pr (A) = 0.25 + 0.75 (10 + Pr (d ))
Pr (B) = 0.25 + 0.75 × PR ()
Pr (c) = 0.25 + 0.75 × PR (B)
Pr (d) = 0.25 + 0.75 × PR (c)
D:
Pr (A) = 419/35 = 11.97
Pr (B) = 323/35 = 9.23
Pr (c) = 251/35 = 7.17
Pr (d) = 197/35 = 5.63

After the page is added, the level of all pages is increased by 1, but the level value of each page is reduced because the newly added page shares the value of the inbound link generation. From this result, the increase in the page reduces the level value of the existing page, revealing the characteristics that Google algorithms favor small sites. Of course, large sites will also be able to increase the level value by attracting outbound links from other sites due to rich content.

1.12 level distribution for Search Engine Optimization
First, let's look at two columns. The damping coefficient is 0.5, Pr (X)/C (x) = 10;

When there is no link between BC:

Pr (A) = 0.5 + 0.5 (10 + Pr (B) + Pr (c ))
Pr (B) = 0.5 + 0.5 (Pr (a)/2)
Pr (c) = 0.5 + 0.5 (Pr (a)/2)
Get
Pr (A) = 8
Pr (B) = 1, 2.5
Pr (c) = 1, 2.5
When BC is linked to each other:
Pr (A) = 0.5 + 0.5 (10 + Pr (B)/2 + Pr (C)/2)
Pr (B) = 0.5 + 0.5 (Pr (a)/2 + Pr (C)/2)
Pr (c) = 0.5 + 0.5 (Pr (a)/2 + Pr (B)/2)
D:
Pr (A) = 7
Pr (B) = 3
Pr (c) = 3

Although the level of A is reduced when the inter-BC links, the BC increases. This is in line with the optimization idea of optimizing all pages of the site rather than the homepage, because only the levels of each page are improved. When a keyword hits these pages, they can be ranked first. This optimization method is also obvious, that is, to evenly deploy the contribution of the chain among all pages as much as possible, and to add links to low-level pages.

As long as you do not affect ease of use, try to concentrate all outbound links on one or several low-level pages, which can effectively reduce the negative impact of outbound links on page-level computing. Check the column: the damping coefficient is 0.5, Pr (X)/C (x) = 10;

When BCD has outbound links:

Pr (A) = 0.5 + 0.5 (Pr (B)/2 + Pr (C)/2 + Pr (D)/2)
Pr (B) = Pr (c) = Pr (d) = 0.5 + 0.5 (Pr (a)/3)
D:
Pr (A) = 1
Pr (B) = 1, 2/3
Pr (c) = 1, 2/3
Pr (d) = 2/3
When the outbound link is concentrated on D:
Pr (A) = 0.5 + 0.5 (Pr (B) + Pr (c) + Pr (d)/4)
Pr (B) = Pr (c) = Pr (d) = 0.5 + 0.5 (Pr (a)/3)
D:
Pr (A) = 1, 17/13
Pr (B) = 1, 28/39
Pr (c) = 1, 28/39
Pr (d) = 28/39

From the results, the levels of the ABCD pages are increased after the outbound link set.

1.13 other factors affecting the Page Level
After the papers of Lawrence Page and Sergey Brin on PageRank were published, in addition to the link structure of the web, there were no other factors added to PageRank algorithms that were widely discussed. Lawrence Page himself pointed out the following potential Influencing Factors in PageRank's patent Description: link visibility, link location in the document, distance between web pages, and importance of outbound link pages, the page is out of date. With this increase, the random surfing model can be used to simulate the web-based behavior of humans.
Whether or not the above additional factors are used in the actual calculation of PageRank, how to implement these additional factors remains to be discussed.
First, the algorithm formula needs to be improved.

Pr (A) = (1-D) + d (Pr (T1) × L (T1, A) +... + Pr (TN) × L (TN, ))

Here, l (T1, A) is the evaluation value of the inbound chain. It consists of several factors and only needs to be calculated once before iteration, reducing the number of queries to the database, the query results for each iteration are different.

Lawrence Page pointed out in PageRank's patent description that the two factors of the link evaluation are the link visibility and position in the document. Link evaluation replaces PR (a)/C (A) and points out that the probability of each link being clicked on a specific page is different.
Here, each link has two attribute values, X indicates visibility. If not highlighted (such as bold or italic) is 1, otherwise it is 2, and y table chain is connected to the position in the document, if the value is 1 in the lower part of the document, otherwise it is 3. Then there is

X (A, B) × y (a, B) = 1 × 3 = 3
X (A, c) × y (a, c) = 1 × 1 = 1
X (B, A) × y (B, A) = 2 × 3 = 6
X (B, c) × y (B, c) = 2 × 1 = 2
X (C, A) × y (C, A) = 2 × 3 = 6
X (C, B) × y (C, B) = 2 × 1 = 2
Easy to use:
Z (A) = x (a, B) × y (a, B) + x (a, c) × y (a, c) = 4
Z (B) = x (B, A) × y (B, A) + x (B, c) × y (B, c) = 8
Z (c) = x (C, A) × y (C, A) + x (C, B) × y (C, B) = 8
Link evaluation formula: (page T1 points to t2)
L (T1, T2) = x (T1, T2) × y (T1, T2)/Z (T1)
Include:
L (a, B) = 0.75
L (a, c) = 0.25
L (B, A) = 0.75
L (B, c) = 0.25
L (C, A) = 0.75
L (C, B) = 0.25
Finally, use the improved formula to calculate the page level:
Pr (A) = 0.5 + 0.5 (0.75 PR (B) + 0.75 PR (c ))
Pr (B) = 0.5 + 0.5 (0.75 PR (A) + 0.25 PR (c ))
Pr (c) = 0.5 + 0.5 (0.25 PR (A) + 0.25 PR (B ))
D:
Pr (A) = 1, 819/693
Pr (B) = 1, 721/693
Pr (c) = 1, 539/693

To prevent human-level optimization, the page distance is used to affect the evaluation of links. The weight of the intra-site link is smaller than that of the inter-site link. The page distance may be determined by whether the page is in a station, a server, or physical distance.
Another parameter that can affect the importance of a page is the page's out-of-date (up-to-dateness), which means that more newly created pages point to a page, the page content is less likely to be outdated.
To increase the influence of these factors, the formula should be revised as follows:

L (Ti, A) = K (Ti, A) × K1 (Ti) ×... × km (Ti)

K (Ti, A) indicates the link visibility and position weight, and kN (Ti) indicates the influence of the nth factor on the page ti. Look at the column: here, the link derived from C is 4 times more important than the others.

K (A) = 0.5
K (B) = 0.5
K (c) = 2
Calculation level value:
Pr (A) = 0.5 + 0.5 × 2 PR (c)
Pr (B) = 0.5 + 0.5 × 0.5 × 0.5 PR ()
Pr (c) = 0.5 + 0.5 (0.5 PR (B) + 0.5 × 0.5 PR ())
D:
Pr (A) = 1, 4/3
Pr (B) = 1, 2/3
Pr (c) = 1, 5/6

In this case, the sum of the levels of all pages is not equal to the number of pages.

1.14 improvement of PageRank algorithm

 1. Topic-sensitive PageRank)
In this algorithm, we need to pre-calculate the page importance score when going offline. Then, we calculate multiple importance scores for each page, calculate the importance score of the page based on different topics. During the query, these importance scores are combined with the importance scores of the queried topic to form a composite PageRank score. This method can be used to form more precise sorting values, rather than the original normal sorting values.

 

2. quadratic equation inference (quadratic extra polation)
This is a method that can speed up PageRank operations. It can reduce the non-main feature vectors of the current matrix multiplication and power iteration cyclically, greatly accelerating the convergence speed. When using this method to calculate the PageRank value, when calculating a network diagram containing 80 million nodes, the computing speed can be increased by 20%-300% compared with the original PageRank method.

 3. blockrank algo rithm)

This is another acceleration algorithm of the PageRank algorithm. It first divides the network into different regions (blocks) according to the domain, and calculates their local PageRank values for each region; estimate their relative importance (blockrank value for each region); Use block-rank for this region. value to assign a certain weight to the block-rank of each region. Then place the weighted parts

The PageRank value of is considered as a global PageRank vector, which is used as the start vector of the standard PageRank algorithm. This method can reduce the number of iterations of computing and use more time for computing in areas with slow convergence speed, improving the effectiveness of partial PageRank calculation. Blockrank can be calculated in parallel or in a distributed manner to save time. In addition, the partial PageRank calculation results can be reused in future calculations.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.