The mysteries of Google's search engine

Source: Internet
Author: User
1. Background and Problems

  • According to statistics, over 80% of users rely on search engines to obtain information.
  • Website ranking is the core of the Network Search Engine
  • Currently, Google databases store tens of billions of web pages and provide query services more than 0.3 billion times a day.
2. Google Query Process
3. Google search Core algorithms

  • PageRank is a method that Google uses to evaluate the importance of a Web page. by using this method, Google ranks various websites. when a user performs a search, Google will output the qualified websites in the ranking order.
  • The mathematical knowledge used in PageRank algorithms includes: Positive Matrix properties, feature values and feature vectors, power Iteration Algorithms, and Gauss-Seidel iteration algorithms.
  • PageRank is a number between 0 and 1. The larger the score, the more important the page is.
4. PageRank algorithm ideas

1) PageRank is based on the hypothetical relationship

"Many high-quality web pages with super links must be high-quality web pages" to determine the importance of all web pages.

The importance is characterized by the probability of the web page being accessed.

  • Import link: popularity indicator in a simple sense
  • Whether the import link is highly popular: There are applicable welcome indicators
  • Import link Source Page export link: Selected probability indicator

2) PageRank is based on the following theory:

  • If webpage B has a link to webpage A (the imported link called webpage B as webpage A), it means that webpage B considers link a as a "important" webpage. when the level (importance) of webpage B is relatively high, webpage a can obtain a certain level (importance) from the import link of webpage B and distribute all exported links to webpage a equally. (The exported link is a link to another website on the website or page)
  • In the PageRank algorithm, the level (importance) of a webpage is roughly determined by the following two factors: the number of imported links of the webpage and the level (importance) of these imported links ).
5. PageRank computing 1). Adjacent matrix

  • The Internet is a directed graph.
  • Each webpage is a vertex of a graph.
  • Each hyperlink between webpages is a directed edge of a graph.
  • Uses the adjacent matrix G to represent a directed graph.JWebpageIIf a hyperlink existsGij= 1; otherwiseGij= 0.

The adjacent matrix is a very large and sparse matrix (represented in black and 0 in White)


  • Uses the adjacent matrix G to represent a graph.JWebpageIIf a hyperlink existsGij= 1; otherwiseGij= 0.
  • Define columns and rows of matrix G and


WhereCJ(Column and) are pagesJNumber of exported links,

Ri(Rows and) are pagesINumber of import links.


2) transfer probability matrix

  • Suppose we browse the page while surfing the Internet and select the next page. This process has nothing to do with which pages we browsed in the past, but only depends on the current page, this selection process can be considered as a random process with limited State and discrete time, and its state transfer rule is available.Markov ChainDescription.
  • Define transfer probability matrix

3), 85% and 15%

  • However, even though users are moving forward along the links on the current page in many occasions, they often jump to completely unrelated pages.
  • According to statistics, Google uses 15% to indicate [regular], that is, when a user advances along the link in 85%, but when 15%, the user suddenly jumps to unrelated pages.
  • To correct the status transition matrix


4) The final PageRank value of the webpage.

  • Based on the basic nature of the Markov chain, the above a' is a regular Markov chain with a stable distribution.X= (X1,X2,X3 ,...,XM) To meet


  • XIndicates the probability distribution of webpage access in the limit state (transfer times tend to be infinite.
  • XIs defined as the PageRank vector of the webpage,XIIndicatesIPage PageRank value. Obviously, the higher the probability, the higher the importance is reasonable.
  • ComponentXEquation to be satisfied


  • From another perspectiveJSet its PageRank ValueXJSplitCJRespectively, "Vote" for the webpage linked to it.
  • WebpageKPageRank valueXKThat is, all pages vote for webpages.KThe final value.

5) discuss the solution

The maximum feature value of a' is 1, that is, the feature vector corresponding to feature value 1.

Q:Is the solution of the above equations a unique solution? Is the solution meaningful (that is, there will be no negative number or a number greater than 1 )?

A:The solutions of the above equations are unique and the components are greater than 0!

Reason:Perron-frobnius theorem.

6. Perron-frobnius Theorem

1) if A is a positive matrix (all elements are greater than 0 ),

ASpectral RadiusR(A)> 0, whereL1,L2 ,...,LnIsA.

L=R(A) YesAIs a single feature value.

UniqueX> 0, yesA x=X, And

IfLYesAAndLBytesR(A), Then |L| <R(A).

2) L= 1 isAFeature value (Ax=X).


7. WebPage Ranking example

For example, the PageRank algorithm is used to calculate the ranking of each web page in the following small network.P= 0.85.


Resolution

X= (0.2675, 0.2524, 0.1323, 0.1698, 0.0625, 0.1156) T


  • The importance of Web Page 1 is the highest. Although the number of imported links in web page 2 is only one, it is the only external link in web page 1, so its importance is also significantly improved!
  • Although webpage 3 is an external link of webpage 2, it can only get the second half of the webpage.
In fact, the problem is not that simple!

  • Google has to deal with tens of billions of web pages, which requires a huge amount of computing,
  • In particular, the calculation feature value AX = x requires extremely high computing power and requires attention to the complexity of computing. Many numerical computing tools are used here.
  • In addition, we also discuss web page indexing, query, and other methods.
Reference: search by Google

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.