Information retrieval Notes-index Construction _ Data Retrieval index construction

Last Update:2018-08-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

How to build inverted indexes, we call this process "index build." If we have a large number of documents, so that the index can not fit memory, how to build.

Limitations of hardware

We know that RAM read and write is a random operation, as long as the corresponding address cell input can instantly read or write the data. But the disk does not, the disk must have a path-seeking process, plus a rotation time. Then we can consider how to save I/O operation time only when it comes to disk.

The "note" operating system tends to read and write in blocks of data. Because it can take as much time to read a byte and read a database.

A block-based sort indexing method (BSBI)

Usually create a inverted index: we need to scan the document for all the word items-the document ID pairs, and then sort the key by the word item as the key, the document ID, and then we may have to count the document frequency and the frequency of the word item. For small documents, this process is not problematic in memory, but for large collections of documents, it may not be possible to do so in memory.

BSBI The first step is to map the word item to an ID (why it is mapped to an ID and increase efficiency) when it is collected.

BSBI The second step, divide the document into smaller files (blocks) of the same size.

BSBI The third step, each small file is sorted by id-the document ID, and the temporary sort file generated in the middle is placed on disk.

BSBI The fourth step, the sorted small files are merged to sort to get the final inverted index. Due to insufficient memory, we must use a disk-based external sort algorithm (multiple merge sort, loser Tree http://blog.csdn.net/lsjseu/article/details/11708587).

Memory single-pass scan index construction (spimi)

The Bsbi build method is extensible, but requires a data structure that maps to IDs, which can cause memory to be out of place for large documents.

Spimi is not stored with an ID, but the word item itself is used directly.

Spimi The first step, although the CPU memory is small, but can also build some, so spimi now build the index memory, until the memory is full. This is where Spimi writes the constructed index to the disk file.

Spimi The second step, when the first step is finally deposited, the dictionary needs to be sorted, meaning that it is sorted and then stored to disk.

Spimi The third step, the multiple blocks are merged into the final inverted index.

"Note" Spimi and BSBI difference, BSBI is the edge processing each small file, the edge is lined up. And Spimi finished each small file, and then sorted the whole small file.

Distributed Build Method

In practice, the files are very large, a machine may not be processed. This is a distributed system that can be used to build indexes. Here we need to use the MapReduce architecture. MapReduce a large processing task into subtasks, and then distributed to each compute node for calculation, after the completion of the merge to get the final results.

Each subtask is not assigned to a node in advance, but is dynamically allocated by the master node during the running process. In this case, a fast node can allocate a little more (because soon it is done, you can redistribute the task of the heart), the slower nodes may be relatively small. Or if a machine is not moving, it can move its tasks to other nodes to complete.

Dynamic Building Indexing Method

The above methods are static build indexes, which means My documents are unchanged, but there is a possibility that my documents are changing. Then, we need to dynamically build the index at this point.

At this point, we can build two indexes, a large primary index, and a small secondary index in memory. When retrieving, we can iterate through both indexes and merge the two index results.

Also, if our secondary index is large, we merge it into the main index.

Postscript

In fact, the inverted table is also very large, then how do we compress it. Please see below: http://blog.csdn.net/lsjseu/article/details/12239967

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Information retrieval Notes-index Construction _ Data Retrieval index construction

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Information retrieval Notes-index Construction _ Data Retrieval index construction

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support