Chapter V 5.1 Analysis of PageRank algorithm

Source: Internet
Author: User
Chapter V 5.1 Analysis of PageRank algorithm

In the early days of Internet development, search engines sort web pages based on the number of times the search phrase appears on the page, and modify the weights based on the page length and the Importance prompts of HTML tags. The Link Popularity technology uses other documents to link to the current number of page links to determine the importance of the current page, which can effectively resist artificially processed Page Spoofing search engines. PageRank calculates the importance of a page. different weights are assigned to each link. The more important a page is, the higher the link is. The importance of the current page is determined by the importance of other pages. The importance of a Web page is an internal subjective matter, which depends on the reader's interest, knowledge, and attitude. however, there are still many webpages with objective and relative importance. this article briefly introduces how to rate webpages and effectively measure people's interest and importance in webpages.

PageRank mathematical definition:

Pr (A) = (1-D) + d (Pr (T1)/C (T1) +... + Pr (TN)/C (TN ))
Where: Pr (a): page a's webpage level,
Pr (Ti): Page Ti webpage level. Page Ti links to page,
C (Ti): the number of links from the page Ti chain,
D: damping factor. The value ranges from 0.85 to 0.95. Generally, the value is 0.85. This factor directly affects the convergence speed during PageRank iteration.

, 1 ) The algorithm is not sorted by site. The page levels are determined by independent pages; 2 ) the webpage level of a page is determined by the webpage level of the link to its page, but the contribution of each link to the page is different. If Ti the more links on the page, the more pages it links to the current page A the smaller the contribution. A the more pages you link to, the higher the page level. 3 ) Use of damping coefficient, reduces the number of other pages on the current page A sort contribution.

Formula: Lawrence Page and Sergey Brin, founders of Google, proposed a random surfing model for user behavior to explain the above algorithms. They regard users' clicking on a link as a random behavior that does not care about the content. The probability that a user clicks a link on a page is determined by the number of links on the page. This is also the reason for PR (Ti)/C (TI. The probability of a page arriving through random surfing is the sum of the probability of clicking a link that is linked to another page. The introduction of the damping factor D is because it is impossible for users to click the link infinitely and they often jump into another page randomly due to fatigue. D can be regarded as the probability that the user clicks infinitely. (1-D) is the webpage level of the page itself. PageRank is a closed circulation model. The sum of the webpage levels of all pages is equal to the sum of the initial webpage levels.

It is easy to get a constant expression from PageRank's mathematical definition:

P = A P

P indicates the webpage level vector, and a is the link matrix.

The problem of determining the webpage level is converted to finding the feature vector of A, so we can use the iterative method to find it.

Given any initial vectorS, You may wish to set itA,PageRankThe calculation can be as follows:

R0 = s

Loop:

RI + 1 = A ri

D = | Ri | 1-| RI + 1 | 1

RI + 1 = RI + 1 + de

Delta= | RI + 1-ri | 1

WhileDelta>ε

ParametersDIndicates the convergence speed of the iteration. The value is generally0.85. Because the link matrix is huge, the bottleneck for improving the algorithm lies in the inability to exchange space for efficiency. Only the compromise between accuracy and efficiency can be achieved.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.