In the following chapters, we will discuss some basic knowledge about search engines. To really do a good job in search engines, There is no shortcut. To do a good job of searching, the most basic requirement is to analyze 10-20 bad search results every day, so that you will feel it only after a period of time. However, many engineers often cannot do this. The search diligence principle is actually very simple: automatically download as many web pages as possible, create a fast and effective index, and sort webpages fairly and accurately based on relevance. Next we will introduce them one by one.
1. Boolean Algebra
The Calculation of Boolean values is quite simple. It should belong to the knowledge of high school and will not be described here. Let's take a look at the relationship between literature search and Boolean operations. For keywords entered by users, the search engine should determine whether each article has this keyword. If so, it should give the document a logical value-true (true or 1) or false (false or 0 ). For example, we are looking for documents about computer applications, but we don't want to look for software. It can be expressed by such a statement "computer and application and (not software.
2. Index
Most people who use the search engine are surprised that it can find 10 million search results in a very short time. Obviously, it is impossible to scan all the webpage text, so we must use a skill to build an index. The simplest index structure is to use a long binary number to indicate whether a keyword appears in the article. One article corresponds to one article. For example, the binary number of "computer" is 01001000110..., indicating that the second, fifth, ninth, tenth... literature contains this keyword. Similarly, assume that the binary number corresponding to the "application" is: 00101001100 ...., when searching for the "Computer Application" document, we only need to perform the Boolean operation and the document with the result 1 corresponding to the two binary numbers meets the requirements.
Because the number of web pages on the Internet is huge, there are also many words in the network. Therefore, this index is huge. Therefore, the common practice is to divide the index into many parts based on the serial number of the webpage and store them on different servers. When a query is received, the query is distributed to many servers. These servers process user requests in parallel and return the results to the master server for merge processing, finally, return the result to the user.
With the increase of content on the Internet, there are more and more data. Therefore, you need to establish indexes of different levels, such as common and uncommon, based on the importance, quality, and Access Frequency of webpages. This is similar to the difference between the page table and the quick table in the computer. However, no matter how complicated the search engine index is in engineering, the principle is still very simple, that is, it is equivalent to Boolean operations.
Chapter 8: Beauty of Simplicity-index of Boolean algebra and search engines