Lucene Basic Data compression processing

Source: Internet
Author: User
Tags integer numbers

Lucene has taken some special tricks in order to make the storage of information more space-intensive, access faster, and, however, it is easy to confuse us when looking at the Lucene file format, so it is necessary to extract these special rules of skill and introduce them. In the next, random to these rules have some names, is to facilitate the application of these rules later can be simple, the wrong place please everyone understand.

1. Prefix suffix rules (prefix+suffix)

Lucene in the reverse index, to save the dictionary (term Dictionary) information, all the words (terms) in the dictionary is in accordance with 39

Dictionary order, however, the dictionary is wrapped in the document almost all the words, and some of the words are very long, so that the index file is very large, so-called prefix suffix rule, that is, when a word and the previous word has a common prefix, the following words only save the prefix in the word offset, And a string other than the prefix (called a suffix).

For example, to store the following words: term,termagancy,termagant,terminal, if stored in the normal way, the required space is as follows:

[VINT = 4] [T] E [R] [M],[vint = 10][t][e][r][m][a][g][a][n][c][y],[vint = 9][t][e][r][m][a][g][a][n][t], [VInt = 8][t][e][r][m][i][n][a][ L
A total of 35 byte is required.
If you apply the prefix suffix rule, the required space is as follows:
[VINT = 4] [T] E [R] [M],[vint = 4 (offset)][vint = 6][a][g][a][n][c][y],[vint = 8 (offset)][vint = 1][t],[vint = 4 (offset)][vint = 4][i][n][a] L
A total of 22 byte is required. Significantly reduced storage space, especially in the case of dictionary ordering, the prefix of the coincidence rate greatly improved.

2. Difference rule (Delta)

In Lucene's reverse index, it is necessary to save many integer numbers, such as the document ID number, such as the position of the word (term) in the document, and so on.
As described above, we know that integer numbers are stored in the VINT format. As the numbers increase, the number of bytes that each digit occupies increases gradually. The so-called Difference rule (Delta) is the time to save two integers, followed by the whole number is just the difference between the first and the previous integer.

For example, to store the following integers: 16386,16387,16388,16389 if stored in the normal way, the required space is as follows:

[(1) 000, 0010][(1) 000, 0000][(0) 000, 0001],[(1) 000, 0011][(1) 000, 0000][(0) 000, 0001],[(1) 000, 0100][(1) 000, 0000][ (0) 000, 0001],[(1) 000, 0101][(1) 000, 0000][(0) 000, 0001]
Supply and demand of 12 byte.
If you apply a difference rule to store it, you need the following space:

[(1) 000, 0010][(1) 000, 0000][(0) 000, 0001],[(0) 000, 0001],[(0) 000, 0001],[(0) 000, 0001] A total of 6 bytes is required.
Greatly reduces the storage space, and whether the document ID, or the position of the word in the document, are in the order from small to large, gradually increased.

3. Contingent following rules (A, B?)

Lucene Basic Data compression processing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.