Chapter 4 of Introduction to Information Retrieval

Source: Internet
Author: User
Document directory
  • Improvement Method:

I. index construction Influencing Factors

Index building refers to the entire process of converting a document into inverted indexes;

(1) Factors to consider include the memory size and CPU clock frequency. For example, if the memory is particularly large, all documents can be put into the memory, and can be quickly built into inverted indexes;

(2) We need to put as much content as possible in the memory;

(3) We need to consider the seek time. Therefore, we must store the continuously read data in a continuous block;

After the document set is changed to term --> docid, the number of word item-Document ID pairs is the number of tokens;


Ii. bsbi

Here we are considering a large document set (not all documents can be put into the memory ).

As the name suggests, bsbi (Block sorted-based indexing) is based on block sorting. The procedure is as follows:

(1) divide the entire document set into parts of the same size (each part is exactly the same block size) and put it into the memory;

(2) resolve each part into a word item ID> Document ID pair. (To make index building more efficient, we use the word item ID instead of the word item itself, of course, you also need to maintain a ing table of the word item ID and put it into the memory );

(3) sort the word item ID of the block> Document ID by word item, and construct a temporary posting for the Document ID corresponding to the same word item ID;

(4) write the result back to the disk and parse the next block;

(5) After all the document sets are parsed, merge the inverted indexes of each block and finally form a complete inverted index;

The core algorithm of bsbi is (2) and (3). Therefore, the complexity of the core algorithm is O (tlogt), which is not considered in this complexity. (5) step external Merge Sorting; therefore, only temporary posting is generated for the parsing document +;


Iii. spimi

Disadvantages of bsbi: You need to maintain the ing table of the word item ID. If the document set is large, the ing table will also be large;

Spimi (simple pass in memory indexing:

(1) divide the entire document set into word items-> Document ID pairs;

(2) Take this series of word items> Document ID pairs as a stream. If there is still memory, it enters the memory one by one;

(3) A Dictionary (hash implementation) has been built in the memory. Therefore, you only need to use word items to find the location where the Document ID should be placed and put it into the Document ID;

(4) When the memory is full, sort the dictionary items by word, and write the sorted dictionary and posting back to the disk;

(5) After all the document sets are parsed, there should be multiple sorted dictionary-posting in the disk; just merge these;

The spimi core algorithm is (2) (3), so the algorithm complexity is O (t );

4. Map-Reduce

All of the above is for a general large document set, but if it is for a massive document set, it is obviously not feasible to process it on a single machine; therefore, we have come up with the idea of divide and conquer;

The main idea is to distribute the document set on the Computer Cluster for processing. In master-slave mode, the master computer controls the processing process. For example, a computer suddenly breaks down when processing the document set, the master will hand over the document set task processed by this computer to another machine for processing;

Map stage: divide the document set into n data slices and passAnalyzerAfter processing, it becomes a sorted word item-> Document ID pair, which is a general bsbi or spimi algorithm;

Reduce stage: for example, the dictionary of each machine is divided into two parts: a-I and j-Z, and then a-J is handed over to oneInverted arrangeSo here we only need two inverted schedulers;

Note:

(1) analyzer and inverted parsers are a computer, and a machine can be either an inverted parsers or a analyzer;

 

Detailed map stage:

(1) The document set exists in an independent machine. The document set is divided into data slices and transmitted to the Machine Used as the analyzer respectively (I/O time must be considered here );

(2) Dividing bsbi or spimi into word item IDs --> document IDS (involving the time consumed by comparison times );

Reduce stage details:

(1) Transfer the partition file to the inverted array through io (I/O time needs to be considered here );

(2) sort the word id --> Document ID (nlogn) N in each inverted array to the number of entries;

(3) write the inverted record table to an independent machine. (Note: The dictionary size and posting size must be considered here. The dictionary size is the word size, and the posting size is the word size, and pay attention to the IO time );

5. dynamic index construction

In the above discussion, the document remains unchanged, but in fact the document will be updated, inserted, or deleted; therefore, we need to introduce dynamic indexes;

Main Idea: continue to maintain the primary index, construct a secondary index (indicating the added document), and maintain an invalid bit vector (indicating the deleted document). When the secondary index is large enough, merge with the primary index;

During retrieval, we must combine the search results of the primary index and secondary index and filter the invalid bit vectors;

Improvement Method:

I0, i1,... II ...... each index is (2 ^ I) * n;

The indexing process:

Maintain a secondary index in the memory;

(1) At the beginning, when the secondary index is full, I0 is built and stored to the disk to clear the secondary index;

(2) When the secondary index is full for the second time, it is merged with I0 and becomes I1 (I0 is removed );

(3) Build I0 when the third secondary index is full;

(4) When secondary indexes are full for the fourth time, I0, i1, and secondary indexes are merged to I2 (I0 and I1 are cleared );

Similarly, we can see that the index building process is based on the increase in the number of binary; that is:

I3 I2 I1 I0

0 0 0 0

0 0 0 1

0 0 1 0

0 0 1 1

0 1 0 0

0 1 0 1

0 1 1 0

0 1 1 1

During retrieval, We need to merge the results from the I0 to the in index and filter invalid bit vectors, so it is troublesome;

Vi. Access Authorization

Generally, posting of inverted indexes is sorted by Document ID, which helps to compress data and only needs to be inserted at the end. However, ranked retrieval needs to be sorted by weight, however, if you want to add data, you need to scan it to determine the Insertion Location;

Generally, access authorization refers to whether a user has the right to access the document and processes it through the access control list (ACL). The access control table can also be constructed using an inverted index structure;

Dictionary indicates each user, while Posting indicates the Document ID that the user can access. In simple words, it is a user-Document ID matrix;

 

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.