Chapter 8: Beauty of Simplicity-index of Boolean algebra and search engines

Source: Internet
Author: User

In the following chapters, we will discuss some basic knowledge about search engines. To really do a good job in search engines, There is no shortcut. To do a good job of searching, the most basic requirement is to analyze 10-20 bad search results every day, so that you will feel it only after a period of time. However, many engineers often cannot do this. The search diligence principle is actually very simple: automatically download as many web pages as possible, create a fast and effective index, and sort webpages fairly and accurately based on relevance. Next we will introduce them one by one.

1. Boolean Algebra

The Calculation of Boolean values is quite simple. It should belong to the knowledge of high school and will not be described here. Let's take a look at the relationship between literature search and Boolean operations. For keywords entered by users, the search engine should determine whether each article has this keyword. If so, it should give the document a logical value-true (true or 1) or false (false or 0 ). For example, we are looking for documents about computer applications, but we don't want to look for software. It can be expressed by such a statement "computer and application and (not software.

2. Index

Most people who use the search engine are surprised that it can find 10 million search results in a very short time. Obviously, it is impossible to scan all the webpage text, so we must use a skill to build an index. The simplest index structure is to use a long binary number to indicate whether a keyword appears in the article. One article corresponds to one article. For example, the binary number of "computer" is 01001000110..., indicating that the second, fifth, ninth, tenth... literature contains this keyword. Similarly, assume that the binary number corresponding to the "application" is: 00101001100 ...., when searching for the "Computer Application" document, we only need to perform the Boolean operation and the document with the result 1 corresponding to the two binary numbers meets the requirements.

Because the number of web pages on the Internet is huge, there are also many words in the network. Therefore, this index is huge. Therefore, the common practice is to divide the index into many parts based on the serial number of the webpage and store them on different servers. When a query is received, the query is distributed to many servers. These servers process user requests in parallel and return the results to the master server for merge processing, finally, return the result to the user.

With the increase of content on the Internet, there are more and more data. Therefore, you need to establish indexes of different levels, such as common and uncommon, based on the importance, quality, and Access Frequency of webpages. This is similar to the difference between the page table and the quick table in the computer. However, no matter how complicated the search engine index is in engineering, the principle is still very simple, that is, it is equivalent to Boolean operations.

Chapter 8: Beauty of Simplicity-index of Boolean algebra and search engines

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.