Mg-index construction

Source: Internet
Author: User

As the name suggests, this chapter is about how to construct indexes, or how to efficiently construct index files for large datasets in limited memory and time. Once this index file is available, the index compression and index-based sorting have been discussed in the previous sections.

Link List

Let's take a look at the most common method to construct such a data structure in the memory, including a term dictionary, which can be implemented using arrays, hash tables, and binary search trees, each item in the dictionary contains a pointer to the inverted list of the term. Therefore, the inverted list of a term is generally implemented using a single linked list because it is dynamic, that is to say, each item contains the document number, the document frequency, and the next pointer.

Then traverse each document. For each term in the document, if there is a dictionary, the document number and frequency are directly placed behind the inverted list of the term. If there is no, add the term to the dictionary and then upload it.

After all the documents are processed, a complete inverted index is saved in the memory. The last step is to store the index in the memory to an inverted file on the disk.

This method is easy to understand and simple to look good, but there is a problem, that is, if there is a large set of documents, the memory consumption is very large.

In this case, you can cache the linked list to the database or the virtual memory. Of course, this method is not feasible because the efficiency is too low. In the index document, you need to traverse the inverted list to find the linked list corresponding to each term. This random access will have a lot of disk reads and writes.

 

Sort-based inverted sorting

If the index of a large document set is limited in memory, it is impossible to put all the indexes in the memory, but it must be placed on the disk. The problem with the disk is that the random access efficiency is very low. Therefore, if all index items are stored in files in an orderly manner, it is more efficient to access disk files in this order.

So what kind of data structure is suitable for sorting in files? The answer is <t, D, FD, T>

In this way, the index document does not need to traverse and search, and the triple itself records all the relationships. Of course, the triple will cause data redundancy and the termid should be retained in each triple.

The algorithm is as follows:

1. Create an empty dictionary s and a temporary file on the disk

2. traverse each document and add the terms and words in the dictionary to the dictionary. If so, read termid to form a triple <t, D, FD, t> Save it to a temporary file.

3. Start sorting. Because the memory is limited, it is impossible to read all the three tuples and sort them in segments.

That is, each time you read data into the memory that can accommodate K three tuples, sort by termid in ascending order. If termid is the same, it is sorted by docid in ascending order.

Then, the K-TH triple after sorting is a sorted run)

Then write the ordered segments back to the temporary file.

In this way, the data is continuously read until all processing is complete.

4. the current situation is that the temporary files are all sorted segments, so what we need to do is merge. If there is an initial R ordered segment, use logr pass merge, generate the final ordered segments, that is, sorting is completed, which is a standard external sorting method.

5. Finally, read the temporary file sequentially to generate the final compressed inverted file. After that, you can delete the temporary file.

The advantage of this method is that it always reads the disk sequentially, and there is no random access to search for traversal. But there is a problem that it will consume a lot of disks, because when you are doing two merge segments, the merge results must be saved to a new temporary file, that is, the disk space that exceeds the original file size by 2 times during peak hours.

 

Index compression

As mentioned above, sorting-based inverted sorting consumes too much disk resources, so we will discuss how to minimize disk resource consumption when creating Inverted Files.

Compressing temporary files

The temporary files are all stored with <t, D, FD, T>. For compression of D, FD, and T, we talked about index compression before. There are many methods.

Let's take a look at T compression.

In each sorted merging segment, the T value is not subtracted. Therefore, differential encoding is a natural choice, which is to record the difference between T and the previous one, this T-gap is zero or an integer greater than zero.

It can be directly encoded using a dollar code. As you can imagine, the space required to store T is very small.

We can see that to achieve better compression performance, we must compress the ordered segments. The above algorithm is changed to this

For 2 and 3 merged

Traverse each document and extract the term to form a triple <t, D, FD, T> which is not directly saved to a temporary file. It exists in the memory first. When the number of triple entries in the memory reaches K, sort these K tuples, compress them, and write the compressed segments into temporary files.

4. Because the encoded segments are compressed during sorting, you must first decode the segments and then compress the codes and write them back to the temporary file.

 

In-situ multiplexing

The phase is processor-intensive, rather than disk-intensive. Therefore, the inverted time can be further reduced by using multiple parallel routes.

To promote this function, assume that all the current R parallel segments are R-parallel. First, read a B-byte block from each parallel segment. The size of B depends on the memory size, this block is the buffer of each parallel segment. When a parallel block is read, it is read from the disk.

The r-path merge uses the minimum heap to effectively locate the minimum value from the candidate set.

Then, keep getting the minimum value from the top of the heap and write it to a new temporary file until the sorting is completed. In this case, we still need two times the disk cost of the original file, and whether we can restore the original file. Next we will introduce the in-situ multiplexing algorithm, which is complicated and interesting.

If the data is stored in the same place, it does not occupy more disk space. After the data is stored, it is written back to the original temporary file instead of writing new temporary files, there are several points that need to be carefully considered to achieve this goal.

1. to perform R-based recovery, the system first reads B-byte blocks from each segment to the memory. It is easy to think that we only need to make the sorting result into B-byte blocks, you can overwrite these read block spaces to achieve in-situ recovery.

The problem is that each merged block is not necessarily an integer multiple of B, and the last read block may be smaller than B, so this irregular block will bring troubles to this algorithm, the solution is very simple. It is padding, which adds padding to the end of each bind so that it must be an integer multiple of B.

 

2. According to the above statement, we write the sorted blocks back to the read empty blocks. The problem is that these empty blocks are scattered and unordered. Therefore, we need a fast table block table to record the block sequence. For example, block_table [1] indicates the actual block sequence of the first block in the current temporary file, for example, block_table [1] = 3 indicates that the first block should be moved to the third block.

So finally, we need to rearrange the temporary files based on the block table to restore the original order. This algorithm can be completed in linear time.

The algorithm is as follows. The goal of the algorithm is to meet block_table [I] = I,

Traverse block tables from 1 to n. If block_table [I] is set to K not equal to I, the block I in the block I of the temporary file is not in the real sequence,

In this case, put the current block I in the cache, and then find block_table [J] = I, that is, find the block I in the order and put it on the I position.

Now the J Block is empty. Check whether K is equal to J. If K is equal to J, it is placed on the J position.

If the value is not the same, continue to find block_table [s] = J and put it in the J position. In this way, keep searching until block_table [K]. You can put the K block in the cache.

When the traversal is performed to ensure that blocks 1 to n meet block_table [I] = I, the shuffling is completed.

 

3. Select the size of the B value. If the B value is too large, the R + 1 B block cannot be stored in the memory, too much disk reads and writes will be consumed if the selected size is too small.

The method is to give B an initial value in the form of a power of 2, so that when B is too large, it is convenient for B/2 to half, and this method ensures that the merging segment is still a multiple of the new B value.

When B * (R + 1)> M, Set B = B/2

 

Then, we provide a complete in-situ recovery algorithm,

1. Initialization

Create an empty dictionary s

Create an empty temporary file

Set L to the space occupied by the dictionary

Set k = (M-l)/W, K as the number of three tuples in the same segment, and w as the space occupied by a triple <t, D, FD, T>

Set B = 50kb

Set R = 0 and R as the number of parallel segments

2. Process documents and generate temporary files

Traverse each document d

Parse document into term

For each term

If term exists in S, remove termid

Does not exist in S

Add term to S

Update L and K (because the dictionary s becomes larger, the available memory becomes smaller, so K will become smaller)

Add <t, D, FD, T> to the triple array.

At any time, when the number of triple entries in the triple array reaches K, a parallel segment is generated.

Sort parallel segments by term

Compress each field

Finally, pad the parallel segment to ensure that it is a multiple of B and write it to a temporary file.

R = R + 1

When B * (R + 1)> M, Set B = B/2

3. Return

Read a block from each segment and add the number of these blocks to freelist. That is to say, these blocks have been read and can be overwritten by the output blocks.

Get a triple from each block, and r a total of triple to create the minimum heap

Continuously retrieve the top triple of the heap and put it in the output block. Then, retrieve the new triple from the corresponding block and add it to the heap.

When the output block is full, find an empty block from freelist. If there is no blank block, create a new block append to the temporary file.

Write the output block to this empty block to update freelist and block table.

At the same time, when each block read in the same segment is consumed, read the next block of the corresponding segment and update freelist.

4. Temporary File rescheduling

5. truncate and release useless space at the end

 

Compressed back-to-back memory

Now we put the sort-based inverted method aside and return to the blockchain list method in the memory mentioned above. This method can be combined with the compression technology to achieve better results.

The method to describe now is based on the assumption that the document frequency ft of each term T is known before inverted sorting. Of course, to be known, the implementation is to traverse the statistics first, and then create an inverted row again.

So what are the advantages of taking the necessary effort to count the Document Frequency ft of term t in advance?

A. previously, the inverted list of each term was implemented using a one-way linked list structure, because you do not know how many documents are there and must be dynamic. Now you know the document frequency, then you can know exactly how much space you need to allocate. You can use arrays to replace a single linked list, which can at least save the consumption of this next pointer.

B. Know the total number of documents n. The document number can be encoded using the logn bit instead of the integer 32-bit

C. I know the maximum Document Frequency M. I can also use the logm bit encoding FD, T

In short, knowing some information in advance can reduce the possibility of data encoding. That is to say, the information entropy decreases, so the number of digits used for encoding decreases, thus saving space.

Of course, this method still occupies a lot of memory, so there are two ways to reduce memory consumption.

Dictionary-based splitting

Text-based splitting

The principle is to put all the inverted lists in the memory too large, then the inverted tasks are divided into several small tasks, each time in the memory only keep part of the inverted

Dictionary-based segmentation means that only reverse la s related to some terms are created at a time, and the entire dictionary is overwritten by multiple times.

Text-based splitting means that only part of the text is created at a time, and the entire text set is overwritten by multiple times.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.