Lucene Learning Four: Lucene index file Format (1)

Last Update:2014-12-23 Source: Internet

Author: User

Tags integer numbers

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article was reproduced from: http://www.cnblogs.com/forfuture1978/archive/2009/12/14/1623597.html

Lucene index What is stored in, how to store, that is, Lucene index file format, is to read the Lucene source code a key.

When we actually enter the Lucene source code, we will find:

The index process of Lucene is the process of writing the inverted table into this file format according to the basic process of full-text retrieval.
Lucene's search process is to read the indexed information in this file format and then calculate the process of scoring each document (score).

This article explains in detail the Apache Lucene-index File Formats (http://lucene.apache.org/java/2_9_0/fileformats.html).

first, the basic concept

is an instance of the index that Lucene produces:

Lucene's index structure is hierarchical and consists of the following levels:

Index:
- In Lucene, an index is placed in a folder.
- For example, all files in the same folder form a Lucene index.
Segment (Segment):
- An index can contain multiple segments, a separate segment from the segment, and a new document can be added to create new segments, and different segments can be merged.
- For example, the same prefix file is the same segment, the figure is a total of two segments "_0" and "_1".
- Segments.gen and Segments_5 are metadata files for segments, that is, they hold the attribute information for a segment.
Document:
- Documents are the basic unit of our index, and different documents are saved in different segments, and one segment can contain multiple documents.
- The newly added documents are saved separately in a newly generated segment, and as the segments are merged, different documents are merged into the same segment.
Domain (field):
- A document contains different types of information that can be indexed separately, such as title, Time, body, author, etc., and can be stored in different domains.
- Different domains can be indexed differently, and we will interpret them in detail when the storage of the domain is truly resolved.
Word (term):
- A word is the smallest unit of an index, which is a string of lexical parsing and language processing.

In the index structure of Lucene, the forward information is saved, and the reverse information is saved.

The so-called positive information:

Saved from index to Word, by hierarchy: Index (Index) –> segment (segment) –> document –> domain (field) –> Word (term)
That is, the index contains those segments, each containing those documents, each containing those fields, each containing those words.
Since it is a hierarchy, each level holds the information at this level and the next level of meta-information, i.e., property information, such as a book on Chinese geography, which should first introduce the geography of China, and how many provinces are included in China, each of which introduces the basic profile of the province and how many cities it contains. Each city introduces a basic overview of the city and how many counties are included, each of which specifically describes each county's specific situation.
For example, files that contain forward information are:
- Segments_n saves how many segments this index contains, and how many documents each segment contains.
- XXX.FNM saves how many domains this segment contains, the name and index of each domain.
- XXX.FDX,XXX.FDT saves all the documents contained in this section, how many fields each document contains, and what information is saved for each domain.
- XXX.TVX,XXX.TVD,XXX.TVF saves how many documents this segment contains, how many fields each document contains, how many words each field contains, the string of each word, and the location of the information.

The so-called reverse information:

Saved dictionary-to-inverted table mappings: Word (term) –> document
For example, files that contain reverse information are:
- Xxx.tis,xxx.tii Saves the Dictionary (term Dictionary), which is the order in which all the words contained in this paragraph are sorted in a dictionary.
- Xxx.frq saves the inverted list, which is the document ID of each word.
- Xxx.prx saves the position of each word in the inverted list in the document containing the word.

Before you understand the detailed structure of the Lucene index, look at the basic data types in the Lucene index.

ii. Basic Types

Lucene index file, use the basic type to save the information:

Byte: Is the most basic type, with a length of 8 bits (bit).
UInt32: Consists of 4 byte.
UInt64: Consists of 8 byte.
VInt:
- The variable-length integer type, which may contain more than one byte, for each byte of 8 bits, where the next 7 bits represent a numeric value, and the highest 1 bits indicate whether there is another byte,0 that represents no, and 1 indicates that there is one.
- The earlier byte indicates the low of the value, and the later byte indicates the high of the value.
- For example, 130 is binary to 1000, 0010, a total of 8 bits, a byte cannot be represented, and therefore two bytes are required, the first byte represents the last 7 bits, and at the highest position there is a byte behind it, so that is (1) 0000010, the second byte represents the 8th bit, and the highest position of zero indicates that there is no other byte behind, so it is (0) 0000001.

Chars: is a series of byte UTF-8 encoded.
String: A string is first a vint to represent the number of characters this string contains, followed by the UTF-8 encoded character sequence chars.

Iii. Basic Rules

Lucene has taken some special tricks in order to make the storage of information more space-intensive, access faster, and, however, it is easy to confuse us when looking at the Lucene file format, so it is necessary to extract these special rules of skill and introduce them.

In the next, random to these rules have some names, is to facilitate the application of these rules later can be simple, the wrong place please everyone understand.

1. Prefix suffix rules (prefix+suffix)

Lucene in the reverse index, to save the dictionary (term Dictionary) information, all the words (terms) in the dictionary is sorted in dictionary order, but the dictionary contains almost all the words in the document, and some of the words are very long, so that the index file is very large, The so-called prefix suffix rule, that is, when a word and the previous word have a common prefix, the following words only save the prefix in the word offset (offset), as well as a string other than the prefix (called suffix).

For example, to store the following words: Term,termagancy,termagant,terminal,

If stored in the normal way, the required space is as follows:

[VINT = 4] [T] E [R] [M],[vint = 10][t][e][r][m][a][g][a][n][c][y],[vint = 9][t][e][r][m][a][g][a][n][t],[vint = 8][t][e][r][m][i][n][a][ L

A total of 35 byte is required.

If you apply the prefix suffix rule, the required space is as follows:

[VINT = 4] [T] E [R] [M],[vint = 4 (offset)][vint = 6][a][g][a][n][c][y],[vint = 8 (offset)][vint = 1][t],[vint = 4 (offset)][vint = 4][i][n][a] L

A total of 22 byte is required.

Significantly reduced storage space, especially in the case of dictionary ordering, the prefix of the coincidence rate greatly improved.

2. Difference rule (Delta)

In Lucene's reverse index, it is necessary to save many integer numbers, such as the document ID number, such as the position of the word (term) in the document, and so on.

As described above, we know that integer numbers are stored in the Vint format. As the numbers increase, the number of bytes that each digit occupies increases gradually. The so-called Difference rule (Delta) is the time to save two integers, followed by the whole number is just the difference between the first and the previous integer.

For example, to store the following integers: 16386,16387,16388,16389

If stored in the normal way, the required space is as follows:

[(1) 000, 0010][(1) 000, 0000][(0) 000, 0001],[(1) 000, 0011][(1) 000, 0000][(0) 000, 0001],[(1) 000, 0100][(1) 000, 0000] [(0) 000, 0001],[(1) 000, 0101][(1) 000, 0000][(0) 000, 0001]

Supply and demand of 12 byte.

If you apply a difference rule to store it, you need the following space:

[(1) 000, 0010][(1) 000, 0000][(0) 000, 0001],[(0) 000, 0001],[(0) 000, 0001],[(0) 000, 0001]

A total of 6 byte is required.

Greatly reduces the storage space, and whether the document ID, or the position of the word in the document, are in the order from small to large, gradually increased.

3. Contingent following rules (A, B?)

There is a situation in the index structure of Lucene where a value of B may or may not exist after a value a, and a flag is required to indicate whether followed by B.

In general, a byte is placed after a, 0 does not exist after B, 1 is followed by B, or 0 is followed by a b,1, then no B is present.

But to waste a byte of space, in fact, a bit can be.

In Lucene, take the following approach: The value of a is shifted left one bit, the last one is vacated, as the flag bit, to indicate whether to follow B, so in this case, A/2 is the true value of a.

If you read the Apache Lucene-index File formats This article, you will find a number of rules that conform to this rule:

Docdelta[in the. frq file, Freq?],docskip,payloadlength?
Positiondelta,payload in the. prx file? (but not exactly, as the following table analyses)

There are, of course, some bands that do not belong to this rule:

The Skipchildlevelpointer in the. Frq file is a pointer to the next level of table in a multi-tier jump table, and of course, if it is the last layer, this value does not exist and does not require a flag.
Positions in. tvf file?, offsets?.
- In such cases, the value of the band is present and does not depend on the last digit of the preceding value.
- It depends on a configuration of Lucene, which is, of course, stored in the Lucene index file.
- If position and offset are stored, depending on the configuration for each domain in the. fnm file (termvector.with_positions and Termvector.with_offsets)

Why there are two cases, in fact, can be understood:

For a contingent following rule, because it is not the same for every A/b, it is worthwhile to save 8 times times more space from a byte to a bit when there is a large number of cases.
For non-conforming rules, it is because the configuration of a value is valid for the entire domain (field) or even the entire index, and not every time, so a flag can be stored uniformly.

The description of the following format is confusing in the article:

Positions-<PositionDelta,Payload?> Freq

Payload-<PayloadLength?,PayloadData>

are Positiondelta and payload applicable to the rules of probability following? How do I identify if payloadlength exists?

In fact, Positiondelta and payload do not meet the probabilistic following rules, payload whether they exist, is determined by the configuration of the. fnm file for each domain in the configuration of Payload (fieldoption.stores_payloads).

When payload does not exist, Payloaddelta itself does not obey the principle of probable follow.

When payload is present, the format should be as follows: positions-<PositionDelta,PayloadLength?,PayloadData> Freq

Thus Positiondelta and payloadlength apply the probability following rule together.

4. Jumping table Rules (skip list)

To improve the performance of the lookup, Lucene takes the data structure of the jumping table in many places.

Skip list is a data structure with the following basic features:

Elements are arranged sequentially, in Lucene, in dictionary order, or in order from small to large.
Jumping is an interval (Interval), that is, the number of elements per hop, the interval is pre-configured, the interval of the jump table is 3.
Jumping tables are hierarchical (level), each layer of the elements of each interval at a specified level of the previous layer, jumping table total 2 layers.

It is important to note that in many data structures or algorithmic books there will be a description of the jump table, the principle is roughly the same, but the definition of a slightly different:

Definition of interval (Interval): Some think that the interval is 2, that is, the number of elements between the two upper elements, excluding two upper elements, or 3, that is, the difference between the two upper elements, including the upper layer elements, excluding the preceding upper elements; some think 4, That is, in addition to the elements that are between the two upper elements, both the front and the upper-level elements are included. Lucene is the second definition taken.
Definition of hierarchy (level): Some think should include the original chain surface, and counting from 1, then the total level of 3, is the first-tier layer, some think should include the original chain surface, and from 0 count, for the 0,1,2 layer, some think should not include the original chain surface, and counting from 1, then 1, 2 layer , some think should not include the chain surface, and counting from 0, then 0, 1 layers. Lucene takes the last definition.

Jumping table than sequential lookup, greatly improve the search speed, such as Find element 72, the original to access 2,3,7,12,23,37,39,44,50,72 total 10 elements, after applying the jump table, as long as the first access to the 1th layer of 50, found that 72 is greater than 50, and the 1th layer without the next node, Then access the 2nd layer of 94, found that 94 is greater than 72, and then access the original list of 72, find the element, the total need to access 3 elements.

However, Lucene is different from the theory in the specific implementation, in the specific format, will be described in detail.

Lucene Learning Four: Lucene index file Format (1)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More