Information retrieval Notes-index Construction _ Data Retrieval index construction

Source: Internet
Author: User

How to build inverted indexes, we call this process "index build." If we have a large number of documents, so that the index can not fit memory, how to build.

Limitations of hardware

We know that RAM read and write is a random operation, as long as the corresponding address cell input can instantly read or write the data. But the disk does not, the disk must have a path-seeking process, plus a rotation time. Then we can consider how to save I/O operation time only when it comes to disk.

The "note" operating system tends to read and write in blocks of data. Because it can take as much time to read a byte and read a database.

A block-based sort indexing method (BSBI)

Usually create a inverted index: we need to scan the document for all the word items-the document ID pairs, and then sort the key by the word item as the key, the document ID, and then we may have to count the document frequency and the frequency of the word item. For small documents, this process is not problematic in memory, but for large collections of documents, it may not be possible to do so in memory.

BSBI The first step is to map the word item to an ID (why it is mapped to an ID and increase efficiency) when it is collected.

BSBI The second step, divide the document into smaller files (blocks) of the same size.

BSBI The third step, each small file is sorted by id-the document ID, and the temporary sort file generated in the middle is placed on disk.

BSBI The fourth step, the sorted small files are merged to sort to get the final inverted index. Due to insufficient memory, we must use a disk-based external sort algorithm (multiple merge sort, loser Tree

Memory single-pass scan index construction (spimi)

The Bsbi build method is extensible, but requires a data structure that maps to IDs, which can cause memory to be out of place for large documents.

Spimi is not stored with an ID, but the word item itself is used directly.

Spimi The first step, although the CPU memory is small, but can also build some, so spimi now build the index memory, until the memory is full. This is where Spimi writes the constructed index to the disk file.

Spimi The second step, when the first step is finally deposited, the dictionary needs to be sorted, meaning that it is sorted and then stored to disk.

Spimi The third step, the multiple blocks are merged into the final inverted index.

"Note" Spimi and BSBI difference, BSBI is the edge processing each small file, the edge is lined up. And Spimi finished each small file, and then sorted the whole small file.

Distributed Build Method

In practice, the files are very large, a machine may not be processed. This is a distributed system that can be used to build indexes. Here we need to use the MapReduce architecture. MapReduce a large processing task into subtasks, and then distributed to each compute node for calculation, after the completion of the merge to get the final results.

Each subtask is not assigned to a node in advance, but is dynamically allocated by the master node during the running process. In this case, a fast node can allocate a little more (because soon it is done, you can redistribute the task of the heart), the slower nodes may be relatively small. Or if a machine is not moving, it can move its tasks to other nodes to complete.

Dynamic Building Indexing Method

The above methods are static build indexes, which means My documents are unchanged, but there is a possibility that my documents are changing. Then, we need to dynamically build the index at this point.

At this point, we can build two indexes, a large primary index, and a small secondary index in memory. When retrieving, we can iterate through both indexes and merge the two index results.

Also, if our secondary index is large, we merge it into the main index.


In fact, the inverted table is also very large, then how do we compress it. Please see below:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.