Inverted index compression (lossless compression)

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Index compression (compression ratio vs decompression efficiency)

It is mainly used to encode and compress the inverted list (postings list) in the inverted index.

Encoding method:

1. d-gaps:Encode sequential numbers (such as docid) by difference (D-gaps. (Processing small data requires a small amount of code and processing time is short) encoding does not define the bit mode for storing data, so it does not save any space.

2. Elias-Gamma codeIt combines unary and binary encoding. Two values must be calculated for the encoded number K:

3. Elias-delta codeThe KD is decomposed

KDD uses the unary encoding, KDR uses the binary encoding, and kr uses the binary encoding.

4. Variable byte code:The minimum 7 bits per byte is the binary number, and the highest bits is a decisive bit. The last byte of encoding is 1 at the high position and 0 at the position. Generally, the processing unit is byte, so variable bytecode is fast, but the compression ratio of big data processing is not high..

5. golomb EncodingIn, the integer x is expressed in two parts: the quotient and the remainder. The formula for calculating the quotient is that the formula for calculating the remainder is r = (Q * k)-1. Here K is the basis of the golomb encoding algorithm,

If r <P, the integer can be stored using the ⌊ log2k; otherwise, it will be needed. Here P is the demarcation point and the calculation method is P = 2 ⌊ log2k ⌋ + 1-K.

When R is <P, the golomb encoding is represented by Q 0, 1, and R. Otherwise, the expression is Q 0, 1, and R + P. In this way, integer 9 can be encoded as, 1, and 11 by K = 3.

The selection of parameter k is crucial. If not, the encoded integer will become very large and it takes a long time to decompress the package. Witten et al. (1994) assumes that the integer in the inverted table conforms to the Bernoulli model, then the K value of integer a in a column is calculated using K ≈ 0.69x average value (.

Williams and Zobel describe how to optimize the golomb encoding, and think that the regular golomb encoding for integers is faster and more space-saving than the Elias gamma encoding and Elias Delta encoding.

6. binaryinterpolative coding binary insert EncodingEncode a monotonically incrementing integer series with adjacent numbers.

If in integer series X1, for any given integer XI, the previous number XI-1 and the last number XI-2 are known, the size of Xi is in (XI-1 + 1, xi-2-1) within the range, the maximum number of digits required is log2 (xi-1-xi-2-2 ). Decoding requires XI-1 and XI-2 information, so the number of arrays X2 is obtained from the original X1, that is, each integer obtained from table X1 is in X2, so that it can be recursively encoded
.

Binary insertion encoding (BIC) uses the information of two adjacent numbers to compact recursive encoding on the monotonically incrementing inverted list, which not only has a high compression rate but also has a high decoding speed; this algorithm also takes into account the frequency distribution of words in the document, and optimizes the compression performance of the inverted list by means of clustering, effectively improving the index space efficiency.

7. other compression:

Invertedindex compression using word-aligned binary codes-anh V, Moffat a.2005 (line-based binary encoding)

Performanceof compressed inverted list caching in search engines-Zhang Jiang-gong, lonexiao Hui, Suel t.2008

Invertedindex compression and query processing with optimized document ordering-yan H, Ding s, Suel t.2009

The word-line binary encoding method (Word-alignedbinary code, WABC) compresses the index. This encoding has the advantage of byte operations, its compact binary feature not only guarantees the compression performance of indexes, but also improves the decoding speed of inverted list during query.

Performance ofcompressed inverted list caching in search engines considers the index compression and index cache mechanisms in the search engine, and compares the performance of several compression algorithms, such as bit encoding. to further improve the search performance.

Inverted indexcompression and query processing with optimized document ordering studies the ascending features of the identity (ID) Integer Set in the inverted list document, the query efficiency is improved through more compact representation, fast intersection algorithm, and optimized Document ID order.

"Index compression using 64-bit words", Anh, Moffat (a good stuff recently discovered, and a general integration of previous compression methods. To learn more. Open Source Address: http://ww2.cs.mu.oz.au /~ Alistair/coders-64bit /)

And the related research of author Alistair Moffat: http://ww2.cs.mu.oz.au /~ Alistair/ABSTRACTS/

Reference compression of inverted indexes for fastquery evaluation thesis Source Code address: http://www.seg.rmit.edu.au/projects.html)

For the posting triple <docid, frequence, position>:

1. First, the docid and position are encoded by D-gaps.

2. Then, the three elements in posting are combined using different encoding methods. For example, dvbyte-fgol-pgamma

If Elias-Gamma code, Elias-delta code, variablebyte code, and golomb are selected for experiment comparison, a total of 24 combination solutions are provided.

Here we only introduce the inverted index on the integer data encoding compression method, the relevant other compression methods refer to Wikipedia: http://en.wikipedia.org/wiki/Data_compression

For more information, see: 1. Google
Efficient Implementation of the group varint lossless compression and decompression algorithm

2. Introduction to the doclist Compression Method

Code of the rvlcompress compression method in lemur: (variable Byte encoding compression method V-bytes coding)

/*************************************** * **************** @ Editor: weedge E-mail: weege@126.com @ Date: 2011/08/30 @ comment: 1. compress int32 and returns the compressed length. decompress the compressed byte data, returns the length ********************************** * *******************/# include "rvlcompress. HPP "# include <stdlib. h> # include <stdio. h> # define pow2_7 128 # define pow2_14 16384 # define pow2_21 2097152 # define pow2_28 268435456 # define pow2_31 2147483648u # def INE rvl_compress_mask (1 <7)-1) // 0111 1111 # define rvl_compress_terminate_bit (1 <7) // shift 7 bits left, 1000 0000 # define rvl_compress_byte (D, in, B) d [B] = (char) (in> 7 * B) & (1 <7)-1 )) // take 7 * N bits in height # define rvl_compress_terminate (D, In, B) d [B] = (char) (in> 7 * B) | (1 <7) // take down 7 * N bits and set the high position to 1 // return number of bytes in resultstatic int compress_ints (int * data_ptr, unsigned char * out_ptr, int size); // return S number of ints decompressedstatic int decompress_ints (unsigned char * data_ptr, int * out_ptr, int num_bytes ); /*************************************** decompress the compressed byte data, return the decompressed length @ Editor: weedge @ parameter: data_ptr: byte data to be decompressed out_ptr: unzipped int32 data num_bytes: data size to be decompressed **********************************/ int decompress_ints (unsigned char * data_ptr, int * out_ptr, int num_bytes) {unsigned char * data_end_ptr = Data _ PTR + num_bytes; // points to the unsigned char * data_curr_ptr at the end of the array; int * out_ptr_end = out_ptr; For (data_curr_ptr = data_ptr; data_curr_ptr <data_end_ptr; out_ptr) // traverse each char {If (* data_curr_ptr & 128) in the array // and 1000 0000 to determine whether the maximum bit is 1 {* out_ptr_end = 127 & * data_curr_ptr; data_curr_ptr ++;} else if (* (data_curr_ptr + 1) & 128) {* out_ptr_end = * data_curr_ptr | (* (data_curr_ptr + 1) & 127) <7 ); data_curr _ PTR + = 2 ;}else if (* (data_curr_ptr + 2) & 128) {* out_ptr_end = * data_curr_ptr | (* (data_curr_ptr + 1) <7) | (* (data_curr_ptr + 2) & 127) <14); data_curr_ptr + = 3;} else if (* (data_curr_ptr + 3) & 128) {* out_ptr_end = * data_curr_ptr | (* (data_curr_ptr + 1) <7) | (* (data_curr_ptr + 2) <14) | (* (data_curr_ptr + 3) & 127) <21); data_curr_ptr + = 4;} else {* out_ptr_end = * data_curr_ptr | (* (Data_curr_ptr + 1) <7) | (* (data_curr_ptr + 2) <14) | (* (data_curr_ptr + 3) <21) | (* (data_curr_ptr + 4) & 127) <28); data_curr_ptr + = 5 ;}// for return (out_ptr_end-out_ptr );} /*************************************** * ******************* note: before the int32 data is compressed, perform the D-gaps encoding operation to save the compression time. Compress int32 and return the compressed length @ parameter: data_ptr: int data to be compressed out_ptr: size of the compressed byte data: size of the data to be compressed **********************************/ int compress_ints (int * data_ptr, unsigned char * out_ptr, int size) {int * data_end_ptr = data_ptr + size; // point to int * data_curr_ptr at the end of the array; unsigned int N; unsigned char * out_ptr_end = out_ptr; for (data_curr_ptr = data_ptr; data_curr_ptr <data_end_ptr; data_curr_ptr ++) // traverse each int in the array {n = (unsigned INT) * data_curr_ptr; If (n <pow2_7) // less than 128 * out_ptr_end ++ = 128 | n; // Add 128, that is, set the first position of the binary value of N to 1 else if (n <pow2_14) // 128 <= n <16384 (2 (14) {* out_ptr_end = 127 & N; // and 0111 1111, take the 7-bit lower of N * (out_ptr_end + 1) = 128 | (n> 7); // shift the 7-bit right, take the 7-bit higher of N, 1 out_ptr_end + = 2;} else if (n <pow2_21) {* out_ptr_end = 127 & N; // and 0111 1111, take the lower 7 bits of N * (out_ptr_end + 1) = 127 & (n> 7); // shift 7 bits to the right, take the higher 7 bits of N, the parallel high (8 bits) is 1 * (out_ptr_end + 2) = 128 | (n> 14); out_ptr_end + = 3;} else if (n <pow2_28) {* out_ptr_end = 127 & N; * (out_ptr_end + 1) = 127 & (n> 7); * (out_ptr_end + 2) = 127 & (n> 14 ); * (out_ptr_end + 3) = 128 | (n> 21); out_ptr_end + = 4;} else {* out_ptr_end = 127 & N; * (out_ptr_end + 1) = 127 <(n> 7); * (out_ptr_end + 2) = 127 <(n> 14); * (out_ptr_end + 3) = 127 & (n> 21); * (out_ptr_end + 4) = 128 | (n> 28); out_ptr_end + = 5; # If 0 if (n> = pow2_31) {cerr <"Warning: value exceeded int limit in compression" <Endl ;} # endif }}// for return (out_ptr_end-out_ptr);} int main () {/* test compress_ints * // * test decompress_ints */return 0 ;}

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Inverted index compression (lossless compression)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Inverted index compression (lossless compression)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support