The beauty of mathematics Series 5-the beauty of Simplicity: the indexing of Boolean algebra and search engines
Poster: Wu Jun, Google researcher
[Building a search engine requires the following steps: automatically downloading as many webpages as possible, quickly and effectively indexing, and fairly and accurately sorting webpages Based on relevance. We have already talked about sorting when introducing Google Page Rank. Here we will talk about indexing. We will also talk about how to measure the relevance of webpages in the future, and Automatic Webpage download.]
There cannot be a simpler counting method than binary in the world, or a simpler operation than Boolean. Although every search engine today claims how intelligent and intelligent it is, it basically does not escape the box of Boolean operations.
George Boole is a British elementary school mathematics teacher in the 19th century. No one thought of him as a mathematician during his lifetime. Boolean enjoys reading and discussing mathematics and thinking about mathematics. An Investigation of the Laws of Thought, on which are founded the Mathematical Theories of Logic and Probabilities, for the first time, I showed people how to solve logical problems using mathematical methods.
Boolean algebra cannot be simpler. The Operation has only two elements: 1 (TRUE, TRUE) and 0.
(FALSE, FALSE ). The basic operations are only "AND", "OR" (OR), AND "NOT" (NOT, all three operations can be converted to "AND" Non "AND-NOT ). Only the following truth tables can be fully described for all operations.
AND | 1 0
-----------------------
1 | 1 0
0 | 0 0
This table shows that if one of the two elements of the AND operation is 0, the calculation result is always 0. If both elements are 1, the calculation result is 1. For example, if the judgment "the sun rises from the West" is false (0) and "water can flow" is true (1), then, "The sun rises from the west and water can flow" is false (0 ).
OR | 1 0
-----------------------
1 | 1 1
0 | 1 0
This table shows that if one of the two elements of the OR operation is 1, the operation result is always 1. If both elements are 0, the calculation result is 0. For example, the conclusion that "Michael Jacob is the first in the Competition" is false (0), and "Li Si is the first in the Competition" is true (1 ), then, "Zhang San or Li Si is the first" is true (1 ).
NOT |
--------------
1 | 0
0 | 1
This table shows that the NOT operation converts 1 to 0 and 0 to 1. For example, if "ivory is white" is true (1), then "ivory is not white" must be false (0 ).
Readers may ask what practical problems such a simple theory can solve. The same problem also exists with mathematicians of the time boolean. In fact, in the more than 80 years after Boolean algebra was proposed, it did not have any decent application. In 1938, Shannon pointed out in his master's thesis that using Boolean algebra to implement Switching circuits, so that Boolean algebra becomes the basis of digital circuits. All mathematical and logical operations, such as addition, subtraction, multiplication, division, multiplication, and kaifang, can be converted into binary Boolean operations.
Now let's look at the relationship between literature search and Boolean operations. For a keyword input by a user, the search engine must determine whether each document contains this keyword. If a document contains this keyword, we will give this document a logical value-TRUE (TRUE, or 1). Otherwise, a logical value -- FALSE (FALSE, or 0) is given ). For example, we are looking for literature on atomic energy applications, but we do not want to know how to create atomic bombs. In this way, we can write a query statement "atomic energy and application and (NOT atomic bomb)", indicating that the required documents must meet three conditions at the same time:
-Atomic Energy
-Include applications
-Does not contain atomic bombs.
Each of the preceding conditions has a True or False answer. Based on the preceding truth table, you can determine whether each document is to be searched.
Most early literature search and query systems were based on databases, and query statements were strictly required to comply with Boolean operations. Today's search engine is much smarter than it is, and it automatically converts Users' Query statements into boolean computation statements. Of course, you cannot scan each document to see if it meets the preceding three conditions. Therefore, you need to create an index.
The structure of the simplest index is to use a long binary number to indicate whether a keyword appears in each document. The number of documents is the number of digits. Each digit corresponds to a document. 1 indicates that the document has this keyword, and 0 indicates that there is no document. For example, the binary number corresponding to the keyword "Atomic Energy" is 0100100001100001, indicating that the second, fifth, ninth, tenth, and 16th documents contain a keyword. Note that the binary number is very long. Similarly, we assume that the binary number corresponding to "application" is 0010100110000001 .... When finding documents that contain both "Atomic Energy" AND "application", you only need to perform Boolean AND operation on these two binary numbers. Based on the preceding truth table, we know that the calculation result is 0000100000000001 .... The fifth and 16th documents meet the requirements.
Note that Boolean operations on computers are very fast. Currently, the cheapest microcomputer can perform 32-bit Boolean operations at a time and perform more than one billion operations per second. Of course, because most of the digits in these binary numbers are zero, we only need to record those digits that are equal to 1. Therefore, the index of the search engine becomes a large table: each row of the table corresponds to a keyword, and each keyword is followed by a group of numbers, which is the document serial number containing the keyword.
For Internet search engines, each webpage is a document. The number of web pages on the Internet is huge, and many words are used in the Internet. Therefore, this index is huge, measured in trillions of bytes. Early search engines (such as all previous search engines in Alta Vista) were limited by the speed and capacity of computers, so they could only index key keywords. So far, many academic journals have asked the author to provide 3-5 keywords. In this way, all uncommon words and too common virtual words cannot be found. Now, to ensure that any search can provide relevant webpages, all search engines index all words. For the convenience of WebPage Ranking, a large amount of additional information, such as the location and frequency of each word, needs to be stored in the index. Therefore, the entire index becomes so large that it cannot be stored in one computer. The common practice is to divide the index into multiple parts (Shards) based on the serial number of the webpage and store them on different servers. Each time a query is received, the query is distributed to many servers. These servers process user requests concurrently and send the results to the master server for merge processing, finally, return the result to the user.
No matter how complicated the index is, the basic search operation is still a Boolean operation. Boolean Operations associate logic with mathematics. Its biggest advantage is its easy implementation and high speed, which is crucial for searching Massive amounts of information. Its disadvantage is that it can only give a true or false judgment, rather than a quantitative measurement. Therefore, after an internal search is completed, all search engines sort the qualified webpages according to their relevance before returning them to users.