Mathematical principles for Search and page ranking

Source: Internet
Author: User
Tags idf

First, Boolean algebra and search engines

Search engine is a tool used every day, it is a very complex technology, implementation of a search engine is not easy. However, technology is divided into two types of surgery and Tao, the specific way of doing things is surgery, the principle and principles of doing things are Tao.

Do not talk about the search engine, but can say its way.

The principle of the search engine is very simple relative to its technical realization. Creating a search engine is roughly what you need to do: Automatically download as many pages as possible, build fast and efficient indexes, and sort the pages fairly and accurately based on relevance.

1. Boolean algebra

The number of Boolean algebra originated in binary. China's yin and yang doctrine is the embryonic form of binary, and binary as a counting system, was completed by Indian scholars in the 2-5 century BC. In 17th century, Leibniz perfected the binary counting system and used 0 and 1 to represent its two numbers as the binary we are using today. In 1854, the "Law of Thought" of Boer (a middle school math teacher in the UK in 19th century) showed people how to solve logic problems mathematically.

There are only two elements of Boolean algebra operations: 1 (True, True) and 0 (false, False), and basic arithmetic rules are associated with (and), or (or), non-(not) three. So what's the relationship between Boolean algebra and search?

Whether it is Google or Baidu, the basic principle of its search is based on Boolean algebra. Suppose to search for a document on the application of atomic energy, but it is not like knowing how to build an atom bomb. For each of the user input keywords, search engine to determine whether the document contains the keyword, if any, then give this document a logical value-true (1 or true), and conversely give a logical value false-(0 or false), the corresponding query statement becomes "Atomic Energy and application and (not atom bomb) , the documents that meet the requirements in the search results must meet these three requirements at the same time. According to the arithmetic rules of Boolean algebra, each of these three conditions have a true or false answer, according to this answer can calculate whether the document is satisfied with the requirements.

2. Index

Search engines can search for the desired result based on Boolean algebra, but how does it find thousands of search results in 0 seconds? Obviously, if the scan text, the computer scan speed can not be done. This requires an index to be built.

Google has an interview with PM questions: How to explain to your grandmother search engine? If you answer from the technical level, basically be pass. The good answer is to take the library's index card analogy. Each website is library a book, the page is the content of a page of the book, we can use index card or page number to quickly find the need of the book or book a page of information.

A simple index structure is to use a very long binary number to indicate whether a keyword appears in each document, how many articles, the number of digits, each corresponding to a document, 1 for the corresponding keyword, 0 is not. For the keyword "Atomic energy", its possible binary representation is 0100100011000001 ..., "apply" may correspond to the binary representation is 0010100110000001 ..., the two are Boolean operation and. The result is 0000100000000001 ..., which indicates that the fifth and 16th articles meet the requirements. The computer does the Boolean operation is very fast, now the cheapest microcomputer in a instruction cycle to do 32-bit Boolean operations, a second can be carried out more than billions of times.

Second, the page ranking technology

For a partial query, the search engine returns tens of thousands of results, so how do you rank the results you want to see first? The problem depends largely on the quality of the search engine. For a particular query, the ranking of the search results depends on two sets of information: the quality of the page and the information about the query associated with each page.

1. Web Quality: PageRank

The PR mathematical model was invented by Google's founder Larry Page and Chergues-Bling. On the Internet, if a webpage is linked by many other web pages, it receives universal recognition and trust, then its ranking is high. This is the key idea of PR. The exchange of links is a good illustration of this. For different pages of the link, PR is treated differently: that is, the high-ranking Web site contribution to the right link. So how to calculate the weight of the page?

There are x1, x2, x3, x4 four pages only want to page y, four pages corresponding to the weights are 0.001, 0.01, 0.02, 0.05, then the page y pr=0.001+0.01+0.02+0.05=0.081. The calculation of the PR algorithm is the multiplication of matrices in linear algebra.

2. Relevance of Web pages and queries

A simple way to measure the relevance of Web pages and queries is to use keywords to show the total word frequency in a Web page. For example, a query that contains n keywords w1,w2,w3 ..., the word frequency they appear in a particular webpage is tf1,tf2,tf3 ..., (tf:term frequency abbreviation) So, the relevance of the query to this Web page is

tf1+tf2+tf3+ ...

However, for the purpose of determining the theme of the page is not useful, called the stop word, such as, is, in, the ground, and so on, its weight is 0. Therefore, in the information retrieval, the most used weights are "inverse text frequency index" (Inverse Document Frequency, abbreviated as IDF), the mathematical formula is log (D/DW) (W is subscript), D is the total number of pages. Assuming that the number of Chinese pages d=10 billion, stop the word ' "in all pages appear, its occurrence of the number of DW=10 billion, then its idf=log (1 billion/1 billion) =log (1) = 0. "Atomic energy" appears in 2 million pages, that is, dw=200, so its weight idf=log (500) = 8.96, "Application" appears in 500 million pages, then its idf=log (2) = 1. Using IDF, the calculation formula of the related lines becomes the weighted summation by the simple summation of the word frequency, namely:

tf1*idf1+tf2*idf2+tf3*idf3+ ...

Using this method to calculate the weight distribution is very objective, accurate estimation of the correlation between keywords and web pages.

Reference book: The Beauty of mathematics

Original starting: http://www.ido321.com/1338.html

Filed under: Dom Notes (Eight): JavaScript execution Environment and garbage collection



Mathematical principles for Search and page ranking

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.