Newbie information retrieval 2: inverted table and storage

Source: Internet
Author: User

This article describes the simplest thing to understand in Information Retrieval. It is called inverted table or inverted index. But this is just a name. I think everyone knows what it is. You don't have to worry about it. Let's talk about what the inverted table looks like!

The inverted table is indexed by words. The content is the document number containing the word. Documents 1, 3, 5, 7, and 9 contain the word "cat". Documents 2, 5, 8, and 10 contain the word "dog ". What can you do with such a simple object? In fact, it is the most critical core data structure in the search engine. How does the search engine find relevant documents based on user queries? If you query "cat", you only need to follow the cat chain to return documents 1, 3, 5, 7, and 9 to the user. What if a user wants to obtain a document that contains both "cat" and "dog? This process is similar to the process of merging two sorted segments when merging and sorting. We use two pointers to the first element of the cat chain and the first element of the dog chain respectively, then, compare the document numbers at the two pointers and move them accordingly to find the documents in the two links. The time complexity of the entire process is linear. The above is a simplified inverted table. The inverted table used in the actual search engine is more complex than this one. It is also possible to use multiple different inverted tables to complete different search tasks, however, they are similar in nature.

So how to create this inverted table? In fact, the general process is very simple: put forward every word in the document and insert the document number to the chain of the word index. The main problem here is that in terms of word extraction, English document words and words are separated by spaces and punctuation marks, but there is no such obvious Separator in Chinese.Natural Language ProcessingWhich specializes inWord SegmentationMethods, most of the current word segmentation methods can reach the accuracy rate of more than 90%.

It is not easy to implement inverted tables logically. In addition to word segmentation, it is difficult to easily scale the inverted table so that the memory space of a computer or even the disk space of a computer is not enough to store the entire inverted table. In practice, the number of web pages indexed by the search engine can reach tens of billions. Each web page is calculated based on an average of 1000 words. Each document number is stored in a 4-byte integer, after a rough calculation, the inverted table requires at least 10000000000*1000 * 4B = 40 TB of storage space. Therefore, how to store this inverted table is indeed a tricky problem.

There are two types of big data storage solutions: distributed storage and compression. The previous method is easy to understand, meaning that a machine cannot store things, so it is stored on multiple machines, and some complicated technologies are also required. Of course, if the inverted table is stored on multiple machines, the processing of user requests becomes more complicated. The latter compression technology is to reduce the data size and express the source data in a more compact form to achieve the goal of unchanged content but smaller storage space, in reality, we often use compression software. For example, Gzip in Linux uses lz77.AlgorithmAnd hammanman encoding to compress arbitrary binary data.

I am not familiar with distributed storage, so the introduction of these technologies requires you to look for the blog of the ox in the garden. Data Compression is relatively simple. This article introduces the compression algorithm of inverted table. Some people may say that it is okay to use a ready-made compression tool to compress the inverted table? Although it can reduce the storage space of the inverted table, the access efficiency of data compressed by such compression tools is relatively low.

For example, "I am a Chinese", after gzip compression, You have to extract all the data before you can find the word "medium" in the compressed data. If you are looking for a document in the compressed inverted table, you have to decompress all the inverted tables. This decompression process is a time-consuming process, so it is not cost-effective.

Now we are going back to the problem of compressing the inverted table. To solve this problem, we must connect theory with practice to achieve satisfactory results.

Before introducing the Data Compression Algorithm of the inverted table, we will introduce several common technologies to effectively reduce the size of the inverted table.

Here, we use the actual application background as the starting point to optimize the inverted table. This system is intended for humans. What kind of query words will a person enter?

In English, will someone check the words "the", "A", "an", and so on? Obviously, these words have no practical significance. People prefer to query some names and verbs. Therefore, inverted tables do not have to be indexed by the words "the", "A", and "an", which significantly reduces the size of inverted tables ." The word "," A "," an "is calledDeprecated word. Some people specifically studied what kinds of words can be used as deprecated words, so that they found hundreds of words that can be considered deprecated words. These words are often used. You can find an English document at will.Article, Remove all the disabled words, and then see how many words are there. Therefore, indexing disabled words can effectively reduce the size of the inverted table. There are also stopword lists for Chinese, so this technology is also applicable to Chinese search.

Another technology is calledStem-based. This technology seems to be useful only for Chinese Characters in Europe. Here we will also introduce it. For example, the word "cars" is the plural form of "car", but for us, these two words have the same meaning, so it is a good way to replace "cars" with "car. "Car" can be considered as the stem of "cars. In fact, the singular, plural, and tense of a word can all be dried. This method is even better than removing deprecated words. You can also try to find an article, and then combine all the words to see how many different words are there. Like the above technology, some people specialize in the stem of English words to provide the stem service.

The combination of the above two technologies can greatly reduce the size of the inverted table. Although these two technologies also bring some negative problems to the search, they always think that the advantage is greater than the disadvantage.

Now let's go back to the topic. If we use the above two methods, the inverted table is still very large. What should we do? Then the compression algorithm is playing. Here we do not use complicated algorithms such as gzip. Because the inverted table is simple in form, there is a simple solution to simple problems.

The number of documents must be expressed in a 4-byte integer, but the document numbers in an index are not separated too frequently. They can be expressed in less than 4 bytes. For example, if we confirm that the document number interval does not exceed the range expressed by two bytes, we only use 4 bytes to represent the first document number in an inverted table, the other two-byte interval is used to indicate that the size of the inverted table is half of the original size. For example, the size of an inverted table can be changed from 40 TB to 20 TB. The above thinking is feasible and attractive, but there is still a problem that must be solved: The document interval may be of an order of magnitude as the document number, which cannot be expressed in two bytes.

The following describesVariable Byte encodingMethod: encode the document interval using integer bytes. The 1st bits in each byte represent the continuation bit, and the last 7 bits represent the encoding bit. If the first bit is 1, it indicates that this byte is the last byte encoded; otherwise it is not the last byte.

For example:

5 represents a variable Byte encoding binary format: 10000101

214577 represents the binary form of variable Byte encoding: 00001101 00001100 10110001.

For the second example, the restoration method is provided: obtain a 21-bit binary number by combining the lower seven bits of each byte, and the number is 214577. The coding and decoding process of this encoding method is very simple, and the time overhead is very small, which is much better than that of gzip.

This encoding method is surprisingly simple, so that you think its actual effect is good. Someone builds an inverted table on an authoritative document set (about 800000 documents). The original size is 250 MB, and the size of the inverted table obtained by variable-length coding is only 116 MB. To explain this phenomenon, Zipf's law may be required. You can check this law.

In addition to variable-length encoding, there are also gamma encoding and so on. The granularity of these encoding methods to operate on storage units is smaller than that of one byte, so the encoding method is more complex than variable-length encoding, it can achieve better theoretical compression efficiency and actual compression efficiency. However, the inverted table is also built on the 800000 documentation machine, and the code changes the size of the inverted table to 101 MB, which is only 13% less than the variable-length encoding.

If the word stem is used, removing the deprecated word can reduce the size of the inverted table by 1/10, and then the variable-length encoding can reduce the size of the inverted table by 1/2, now the size of our inverted table is about 18 TB. Although such big data still requires a distributed storage method, this can reduce the demand for the number of machines and save costs. For a Program , a search engine that wants to index millions of webpages on a single machine, it is also a possibility.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.