Technology is divided into two types of surgery, the specific way of doing things is surgery, the principle and principles of doing things are Tao.
The principle of search engine is actually very simple, build a search engine roughly need to do such a few things:
Automatically download as many pages as possible;
Establish a fast and effective index;
A fair and accurate sort of page based on relevance.
1 Boolean algebra
The theory of Yin and yang in ancient China can be considered as the earliest binary model.
In the 1854, the "Law of thought" for the first time showed people how to solve logic problems with mathematical methods.
The two elements of the and operation have a 0, then the result of the operation is always 0.
The two elements of an OR operation have a 1, then the result of the operation is always 1.
The not operation turns 1 to 0 and turns 0 to 1.
Boolean algebra is equivalent to the meaning of quantum mechanics for physics, which expands our understanding of the world from a continuous state to a discrete state.
2 Index
Every website is like a library book, we can not find it on the library shelves, but to search the card to find its location, and then go directly to the shelves to take.
The simplest index structure is to use a very long binary number to indicate whether a keyword appears in each document.
The early search engine was limited by the speed and capacity of the computer, and could only index important and key words. So far, many academic journals have asked the author to provide 3-5 key words.
Indexes are very large and are stored on different servers in a distributed manner. The common practice is to divide the index into many parts according to the serial number of the page, stored in different servers, each time a query is accepted, the query is distributed to a large number of servers, the server concurrently processing user requests, and the results are sent to the master server for merging processing, and finally return the results to the user.
Different levels of indexing, such as common and very useful, need to be established based on the importance, quality, and frequency of access to the Web page. Frequently used indexes require fast access, additional information, and faster updates, rather than less frequently used requirements.