Technology is divided into two types of surgery and road, the specific way of doing things is surgery, the principle and principle of the way.
The principle of search engine is actually very simple, build a search engine roughly need to do such a few things:
Automatically download as many pages as possible;
Establish a fast and efficient index;
A fair and accurate ranking of Web pages based on relevance.
1 Boolean algebra
The theory of Yin and yang in ancient China can be considered as the earliest form of binary system.
In the 1854, the law of thought of Boolean first showed people how to solve the logic problem by mathematical method.
The two elements of an and operation have one is 0, the result of the operation is always 0.
The two elements of an OR operation have one is 1, the result of the operation is always 1.
The not operation turns 1 to 0, and turns 0 to 1.
The meaning of Boolean algebra for mathematics equates to the significance of quantum mechanics to physics, which extends our understanding of the world from continuous state to discrete state.
2 Index
Each site is like a library book, we can not find it on the library shelves, but to find its location through the search card, and then directly to the bookshelf to take.
The simplest index structure is to indicate whether a keyword appears in each document with a very long binary number.
Early search engines can only index important and key words because they are limited by the speed and capacity of the computer. So far many academic journals have asked the author to provide 3-5 key words.
The indexes are very large and are stored on different servers in a distributed way. The common practice is to divide the index into many parts based on the number of pages. stored separately in different servers, each time a query is accepted, the query is distributed to a large number of servers, which simultaneously process the user requests and send the results to the primary server for merging and finally return the results to the user.
Indexes of different levels, such as common and extraordinary use, need to be established based on the importance, quality, and access frequency of the Web page. Frequently used indexes require faster access, more information, and faster updates, rather than less common requirements.