Inverted index-the cornerstone of the search engine

Source: Internet
Author: User

Article transferred from: http://blog.csdn.net/hguisu/article/details/7969757

1. Overview

 

In relational database systems, indexing is the most efficient way to retrieve data ,. However, for search, it does not meet its special requirements:

1) Massive Data: Search Engines face massive data volumes. Large commercial search engines like Google and Baidu index hundreds of millions or even thousands of webpages, this makes it difficult to effectively manage the database system.

2) simple data operations: search engines use simple data operations. Generally, you only need to add, delete, modify, and query functions, and the data has a specific format, you can design simple and efficient applications for these applications. The general database system supports large and comprehensive functions, while losing both speed and space. Finally, the search engine is facing a large number of user search requirements, which requires that the search engine be designed to search programs in a matter of seconds. As far as possible, the large amount of computing work should be completed during index creation, minimize search operations. The general database system is difficult to withstand such a large number of user requests, and the retrieval response time and retrieval concurrency are far less than our specially designed index system.

 

 

2. inverted index

 

Definition from Wikipedia:

Inverted index (English: inverted index), Also known as reverse indexing, placing an archive or reverse archive, is an index method, it is used to store the ing of a word stored in a document or a group of documents in full-text search. It is the most common data structure in the document retrieval system. You can use inverted indexes to quickly obtain a list of documents containing the word based on the word. Inverted indexes mainly consist of two parts: "Word Dictionary" and "Inverted File ".
Inverted indexes have two different reverse indexes:
A record's horizontal reverse index (or reverse archive index) contains a list of documents that reference words.
The horizontal reverse index (or full reverse index) of a word contains the position of each word in a document.
The latter provides more compatibility (such as phrase search), but requires more time and space for creation.
Indexes caused by modern search are based on inverted indexes. Compared with index structures such as "signature file" and "suffix tree,"Inverted index"It is the best way to map words to documents and the most effective index structure.
Simple Example of inverted index: Search Engine-basic knowledge of inverted index

 

3. Inverted list

The inverted list is used to record which documents contain a word. Generally, there are many documents in the document set that contain a word. Each document records the document number (docid) and the number of times the word appears in this document (TF) and where the word appears in the document, so that the information related to a document is called as an inverted index (posting ), A series of inverted indexes containing this word form a list structure, which is the inverted list corresponding to a word. Figure 1 is an inverted list. All the words that appear in the document set and their corresponding inverted lists constitute an inverted index.

Figure 1 inverted list

In the actual search engine system, the actual document number in the inverted index is not stored, but replaced by the document number difference (D-gap ). The document number difference is the difference between the document numbers of two adjacent inverted index items in the inverted table, the document number displayed after the inverted table is greater than the document number displayed before. Therefore, the document number difference is always an integer greater than 0. In the example 2, the numbers of the original three documents are 187, 196, and 199, respectively. By calculating the number difference, the numbers are converted to 187, 9, and 3 in actual storage.

 
Figure 2 Document ID difference

 

The main reason for calculating the difference between document numbers is to better compress the data. The numbers of the original documents are generally large values, this effectively converts a large value to a small value, which helps increase the data compression rate.

 

4. Create an inverted index

4.1 simple index construction

 

Indexing is equivalent to creating a forward table to an inverted table. After analyzing the webpage, we get an index table with the webpage as the primary code. After the index is created, you should obtain the inverted table, as shown in step 3:

Figure 3 index construction

Process:

1) mark a document analysis term as a word,
2) use hash to remove the word term
3) generate a inverted list of words
The inverted list is the document number docid, which does not contain any other information (such as Word Frequency or word location). This is a simple index.
This simple indexing function can be used for small data, such as indexing thousands of documents. However, it has two restrictions:
1) there is a need for enough memory to store inverted tables. for search engines, it is G-level data, especially when the scale continues to expand, we simply cannot provide so much memory.
2) algorithms are executed sequentially and cannot be processed in parallel.

4.3 index creation using the merge Method
Merge method: When data in the memory is written to the disk, all intermediate result information including the dictionary is written to the disk, so that all content in the memory can be cleared, you can use all the fixed memory for subsequent index creation.

4 merge:

 

Figure 4: Merge Indexes

Merge process 5:

1) analyze the page to generate a temporary inverted Data Index A and B. When the temporary inverted Data Index A and B are full of memory, write the memory index A and B to the temporary file to generate a temporary inverted file,
2) execute multiple merge operations on multiple temporary Inverted Files generated, and output the final inverted file (inverted file ).

Figure 5 merging process

PAGE analysis during index creation, especially Chinese word segmentation, is the primary time overhead. The second step of the algorithm is relatively fast. In this way, the optimization of algorithm creation is concentrated on the Efficiency of Chinese word segmentation.

4.2 concurrent and distributed indexing

Go to search engine-Web CrawlerAs mentioned above, the cloud storage document uses the MAP/reduce parallel computing model to generate inverted index columns for the document:

For the task of creating inverted indexes, as shown in figure 6, the input data is also a webpage, with the docid of the webpage as the key of the input data, and the word set displayed in the webpage is the value of the input data; the map operation converts the input data to the form of (word, docid), that is, a word is used as the key, and docid is used as the value of the intermediate data. Its meaning is that word appears on the docid webpage; the reduce operation combines records with the same key in the intermediate data to obtain the webpage ID list corresponding to a word: <word, list (Dodd: POS)>. This is the inverted list corresponding to word. In this way, you can create a simple inverted index. In the reduce stage, you can also perform complex operations to obtain more complex inverted indexes.

Figure 6

 

 


 

5. Index Update policy

 

There are four update policies: completely re-merge, in-situ update, and hybrid.

  1. Full reconstruction policy:When the number of newly added documents reaches a certain level, the newly added documents are integrated with the original old documents, and then all documents are re-indexed using the static index creation method. After the new index is created, the old index will be abandoned. This method is costly, but the current mainstream commercial search engines generally use this method to maintain index updates (this sentence is the original article)
  2. Merge policy: when a new document enters the system, parse the document, and then update the temporary index maintained in the memory, each word in the document will be appended with the inverted table list item at the end of its inverted table list; once a temporary index consumes light on the specified memory, an index is merged. In this case, the order of storing the inverted list in the inverted file is sorted from low to high according to the word dictionary order of the index, in this way, scan and merge sequentially. The disadvantage is that to generate a new inverted index file, many words in the old index are not changed in the inverted list, you also need to extract it from the old index and write it into the new index, which is unnecessary for disk consumption.
  3. In-situ update policy: try to improve the re-merge policy and merge inverted tables in-situ mode. This requires a certain amount of space to be allocated for future inserts. If the space allocated in advance is insufficient, You need to migrate the data. It is shown that the index update efficiency is lower than the merge policy.
  4. Hybrid policy: the starting point is to combine different index update policies with the strengths of different index update policies to form a more efficient method.

 

 

References:

This is the search engine: detailed explanation of core technologies

Search engine-Information Retrieval practices

Inverted index-the cornerstone of the search engine

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.