Search Engine Algorithm Research topic four: Random surfing model Introduction

Source: Internet
Author: User

Http://www.t086.com/class/seo Search engine Algorithm Research topic four: Random surfing model IntroductionDecember 19, 2017 ? Search technology? A total of 2490 characters? small size big ? Comments Off

Google's Lawrence Page and Sergey Brin provide a very simple and intuitive explanation for the PageRank (PR) algorithm. They see PageRank as a model where users do not care about the content of the page and click the link randomly.

The PageRank value of the Web page determines the probability of random access to this page. The probability of a user clicking a link within a page is determined entirely by the number of links on the page, which is also the reason for the above Pr (TI)/C (TI).

Thus, the probability that a page is reached by random surfing is the probability of being clicked on the link to its other pages. Also, the damping factor d reduces this probability. The damping factor d is introduced because the user can not unlimited click the link, often bored and randomly jump into another page.

The damping factor d is defined as the probability that the user is constantly randomly clicking the link, so it depends on the number of clicks and is set to between 0-1. The higher the value of D, the greater the probability of continuing to click the link. Therefore, the probability that the user stops clicking and randomly surfing to another page is expressed in the formula constant (1-d). Regardless of the inbound link, the probability of random surfing to a page is always (1-d). (1-d) itself is the PageRank value that the page itself has.

Lawrence Page and Sergey Brin published 2 different versions of the algorithm formula for PageRank in different publications. In the second version of the algorithm, the PageRank value of page A is obtained as follows:

PR (A) = (1-d)/N + D (pr (T1)/C (T1) + ... + PR (TN)/C (TN))--Algorithm 2

Here n is the total number of internet pages. This algorithm 2, is not completely different from the algorithm 1. In the random surfing model, the PageRank value of the page in algorithm 2 is the actual probability of reaching the page after clicking on many links. Therefore, the PageRank value of all pages on the Internet forms a probability distribution, and the sum of all Ragerank values is 1.

Conversely, the probability of random access to a page in the first algorithm is affected by the total number of internet pages. Thus, the PageRank value solved by algorithm 2 is the expected probability that the page is randomly accessed after the user begins the access process. If the Internet has 100 pages, one of which has a PageRank value of 2; Then, if he will go to the Internet the process of restarting 100 times (Xdanger Note: This sentence specifically means that the user randomly click on the link on the page to enter another page, Every click has a certain probability of fatigue or boredom or any other reason to stop the click, this is the meaning of damping factor D; Every time you stop clicking, it counts as the end of the visit, and then randomly gives a page that lets him start another access process; let him repeat the "formalities" 100 times), The page is accessed on average 2 times.

As mentioned earlier, the two algorithms are not inherently different from each other. The PR (a) obtained by the algorithm 2 is multiplied by the total number of pages in the Internet by N, which is the PR (a) solved by the algorithm 1. Page and Brin have reconciled two algorithms in their most famous publication, The anatomy of a large-scale hypertextual Web Search Engine. The paper claims that algorithm 1 is a probability distribution of PageRank formation for Internet web pages, and it is 1.

Next, we'll use algorithm 1. The reason is that algorithm 1 ignores the total number of Web pages in the Internet, making it easier to calculate.

Suppose a small website consists of three pages A, B, C, a connected to B and c,b connected to c,c connected to a. Although the page and Brin actually set the damping factor D to 0.85, here we set it to 0.5 for easy calculation. Although the exact value of the damping coefficient d is undoubtedly affecting the PageRank value, it does not affect the principle of PageRank calculation. Therefore, we get the following equation to calculate the PageRank value:

(A) = 0.5 + 0.5 PR (C)

PR (B) = 0.5 + 0.5 (pr (A)/2)

PR (C) = 0.5 + 0.5 (pr (A)/2 + PR (B))

These equations are easy to solve, and the following PageRank values are obtained for each page:

PR (A) = 14/13 = 1.07692308

PR (B) = 10/13 = 0.76923077

PR (C) = 15/13 = 1.15384615

It is clear that the sum of all pages PageRank is 3, which equals the total number of pages. As mentioned above, this result is not special for this simple example.

For this simple example of only three pages, the PageRank value can be easily obtained through the equations. But in fact, the Internet contains hundreds of millions of documents, it is impossible to solve the equation group.

Iterative calculation of PageRank

As a result of the actual number of internet pages, Google search engine uses an approximate, iterative calculation method to calculate the PageRank value. This means that each page is given an initial value, and then, using the formula above, the approximate PageRank value is obtained by looping through the finite operations. Again, we use the "three pages" example to illustrate iterative calculations, where the initial value for each page is 1.

Number of iterations PR (A) PR (B) PR (C)

0111

110.751.125

21.06250.7656251.1484375

31.074218750.768554691.15283203

41.076416020.769104001.15365601

51.076828000.769207001.15381050

61.076905250.769226311.15383947

71.076919730.769229931.15384490

81.076922450.769230611.15384592

91.076922960.769230741.15384611

101.076923050.769230761.15384615

111.076923070.769230771.15384615

121.076923080.769230771.15384615

After several repetitions, we arrive at a good approximation of the PageRank ideal value. According to Lawrence Page and Sergey Brin published articles, they actually need to do 100 iterations to get a satisfactory page level value across the Internet.

Similarly, the sum of the PageRank values of each page is still converging to the number of pages in the entire network, using iterative calculations. Therefore, the average PageRank value for each page is 1. The actual value is between (1-d) and (dn+ (1-d)), where n is the total number of internet pages. If all the pages are connected to a page and the page is connected to itself individually, the theoretical maximum value will appear.

Search Engine Algorithm Research topic four: Random surfing model Introduction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.