Solr4.8.0 source code analysis (8) Lucene index file (1)

Source: Internet
Author: User
Tags integer numbers to domain
Solr4.8.0 source code analysis (8) Lucene index file (1)

Note: recently, I was lucky to see the blog of Lucene, the former great god. I felt that my previous study and work were too superficial. So I decided to follow the blog of the former great god to learn the principle of Lucene. Because Jack introduced the e3.x series, I learned the 4. x series based on the source code and the combination. The content may be changed and added to my personal understanding.

Http://www.cnblogs.com/forfuture1978/archive/2009/12/14/1623597.html

I. Basic Types

In the Lucene index file, use the basic type to save the information:

  • Byte: it is the most basic type, with a length of 8 bits. It should be the smallest unit of Lucene.
  • Short: composed of two bytes.
  • INT: consists of four bytes.
  • Long: consists of 8 bytes.
  • Vint:
    • The variable-length Integer type. It may contain multiple bytes. For each byte, the last seven digits indicate the value. The highest one digit indicates whether there is another byte. The value 0 indicates no, 1 indicates yes.
    • Between 1 and 5 bytes.
    • The greater the value, the lower the value, and the higher the value.
    • For example, if 130 is converted to binary 1000,001 0, a total of 8 bits are required. one byte cannot be expressed. Therefore, two bytes are required. The first byte represents the last 7 bits, in addition, the highest position 1 indicates that there is a byte next to it, so it is (1) 0000010, the second byte indicates 8th bits, and the highest position 0 indicates that there are no other bytes next to it, so it is (0) 0000001.
 1 Value    Byte 1    Byte 2    Byte 3 2 0    00000000         3 1    00000001         4 2    00000010         5 ...             6 127    01111111         7 128    10000000    00000001     8 129    10000001    00000001     9 130    10000010    00000001    10 ...            11 16,383    11111111    01111111    12 16,384    10000000    10000000    0000000113 16,385    10000001    10000000    0000000114 ...
  • Vlong:
    • The rule is the same as that of Vint, and the maximum number of bytes is different.
    • Between 1 and 9 bytes
  • CHAR: A series of bytes encoded by the UTF-8.
  • String: A string is first a Vint that represents the number of characters contained in the string, followed by the Character Sequence chars encoded by the UTF-8.

You can view the types in dataoutput. Java and datainput. java. The following uses dataoutput. Java as an example.

  • Int storage method, which is displaced and stored as 4 bytes respectively
 1   /** Writes an int as four bytes. 2    * <p> 3    * 32-bit unsigned integer written as four bytes, high-order bytes first. 4    *  5    * @see DataInput#readInt() 6    */ 7   public void writeInt(int i) throws IOException { 8     writeByte((byte)(i >> 24)); 9     writeByte((byte)(i >> 16));10     writeByte((byte)(i >>  8));11     writeByte((byte) i);12   }
  • Vint storage mode, you can see that the first operation and operation with 127, that is, to determine whether the value is greater than 127, if the value is greater than 127, 8th bits are set to 1 (indicating that there is still the next byte ), write bytes and shift them (divided by 127); otherwise, write bytes directly (indicating that there is no next byte)
1   public final void writeVInt(int i) throws IOException {2     while ((i & ~0x7F) != 0) {3       writeByte((byte)((i & 0x7F) | 0x80));4       i >>>= 7;5     }6     writeByte((byte)i);7   }
  • Long and vlong are similar to int And Vint.
  • String type storage, string type storage is the first to convert string to the UTF-8 format, then store a Vint type of UTF-8 string Character Count, and finally the actual bytes format Storage
1   public void writeString(String s) throws IOException {2     final BytesRef utf8Result = new BytesRef(10);3     UnicodeUtil.UTF16toUTF8(s, 0, s.length(), utf8Result);4     writeVInt(utf8Result.length);5     writeBytes(utf8Result.bytes, 0, utf8Result.length);6   }
II. Basic Rules

Lucene adopts some special techniques to make the storage of information occupy less space and access speed faster. However, when viewing the Lucene file format, these skills are easy to confuse us, so it is necessary to extract these special skills and rules for introduction.

The naming rules are based on the idea.

1. Prefix and suffix rules (prefix + suffix)

Lucene stores the information of the term dictionary in reverse indexing. All words (Terms) are listed in the dictionary in alphabetical order, however, the dictionary contains almost all words in the document, and some words are very long, so that the index file will be very large. The so-called prefix suffix rules, that is, when a word and the previous word have a common prefix, the following words only save the offset of the prefix in the word ), and strings other than the prefix (suffix), the advantage is that it can greatly shorten the storage space.

For example, to store the following words: term, termagancy, Termagant, terminal,

If you store data in the normal mode, the required space is as follows (the string type is stored in Vint + string format ):

[Vint = 4] [T] [E] [r] [m], [Vint = 10] [T] [E] [r] [m] [a] [g] [a] [N] [C] [Y], [Vint = 9] [T] [E] [r] [m] [a] [g] [a] [N] [T], [Vint = 8] [T] [E] [r] [m] [I] [N] [a] [l]

A total of 35 bytes are required.

If prefix and suffix rules are applied, the required space is as follows:

[Vint = 4] [T] [E] [r] [m],

[Vint = 4 (offset)] [Vint = 6] [a] [g] [a] [N] [C] [Y], indicating 6 buckets, obtain four

[Vint = 8 (offset)] [Vint = 1] [T], indicating that one is stored and the first eight are obtained.

[Vint = 4 (offset)] [Vint = 4] [I] [N] [a] [l] indicates four disks, and four

A total of 22 bytes are required.

The storage space is greatly reduced, especially when the prefix is sorted in alphabetical order.

2. Difference rule (DELTA)

The prefix and suffix rules correspond to the string type, and the difference rules apply to the number type.

In Lucene's reverse index, you need to save a lot of integer information, such as the Document ID, such as the position of the word (TERM) in the document.

As described above, we know that integer numbers are stored in Vint format. As the number of values increases, the number of bytes occupied by each number increases. The difference rule (DELTA) is to save the difference between two integers.

For example, to store the following integers: 16386, 16387, 16388, 16389

If the storage is in normal mode, the required space is as follows:

[(1) 000,001 0] [(1) 000,000 0] [(0) 000,000 1], [(1) 000,001 1] [(1) 000,000 0] [(0) 000,000 1], [(1) 000,010 0] [(1) 000,000 0] [(0) 000,000 1], [(1) 000,010 1] [(1) 000,000 0] [(0) 000,000 1]

Supply and demand of 12 bytes.

If the difference rule is applied for storage, the required space is as follows:

[(1) 000,001 0] [(1) 000,000 0] [(0) 000,000 1], [(0) 000,000 1], [(0) 000,000 1], [(0) 000,000 1]

A total of 6 bytes are required.

The storage space is greatly reduced, and both the Document ID and the position of the word in the document are gradually increased in the order of size.

3. or follow the rules (a, B ?)4. Skip table rules (Skip List)

To improve the search performance, Lucene uses the data structure of the Skip table in many places.

The Skip List is a data structure with the following basic features:

  • Elements are arranged in order, in Lucene, in alphabetical order, or in ascending order.
  • There is an interval (interval), that is, the number of elements for each hop. The interval is configured in advance, and the interval of the hop table is 3.
  • A skip table consists of two layers: A level, and the elements of each layer constitute the upper layer at specified intervals.

Note that there are jump table descriptions in many data structures or algorithms. The principles are roughly the same, but the definitions are slightly different:

  • Definition of interval: In, some think that the interval is 2, that is, the number of elements between two upper-layer elements, excluding two upper-layer elements; some think that it is 3, that is, the difference between two upper-layer elements, including the upper-layer elements, excluding the upper-layer elements. Some think that it is 4, that is, except for the elements between the upper-layer elements, including both the front, it also includes the upper-layer elements. Lucene is the second definition.
  • Definition of level: in some cases, the original chain surface layer should be included and counted from 1. The total level is 3, which is Layer 1, 2, and 3; some think that the source link surface layer should be included and counted from 0 to, two layers; some think that the original link surface layer should not be included and counted from 1, and the first and second layers should be included; some believe that the chain surface layer should not be included and counted from 0, which is Layer 1. Lucene adopts the last definition.

Compared with the sequential query, the Skip table greatly improves the search speed, for example, element 72, which originally had to access 10 elements in total, after applying the Skip table, as long as you first access the 50 at Layer 1st and find that 72 is greater than 50, but there is no next node at Layer 1st, then access 94 at Layer 2nd, find 94 is greater than 72, and then access 72 at the original linked list, find the element. You only need to access three elements.

Iii. index file types

The index file types mainly include:

File Name File suffix Introduction
Segments File Segments. Gen, segments_n Information of one submitted operation
Lock File Write. Lock Write lock to prevent multiple indexwriter from simultaneously operating the same file
Segment Info Si Store segment information
Compound File . CFs,. CFE Composite file type
Fields . FNM Store Domain Information
Field Index . Fdx Store index information pointing to domain data
Field Data . FDT Store domain metadata
Term Dictionary . Tim Store dictionary and Domain Information
Term Index . Tip Index information pointing to a dictionary
Frequencies . Doc Stores the frequency information of words contained in a document set, that is, the frequency at which a word is referenced by a document.
Positions . Pos Store word Location Information
Payloads . Pay  
Norms . Nvd,. NVM File and domain length and boost
Per-Document Values . DVD,. DVM  
Term Vector Index . Tvx  
Term Vector Documents . TVD  
Term Vector Fields . Tvf  
Deleted Documents . DEL Store deleted documents

 

Iv. Lock File

Write. Lock only one indexwriter writes the index file at a time. If the write. Lock file is not the same path as the index file, write. Lock uses the absolute path of the index file as the prefix XXXX-write.lock

 

5. forward and reverse Indexes

Lucene index files are divided into forward and reverse indexes. forward indexes include a level from index to segment, to document, to field, to term, forward index files are mainly used to store index information and data. Reverse indexes include the process of ing term to document. They provide the function of searching document by term.

The so-called positive information:

  • The link from index to word inclusion is saved hierarchically: index> segment> document> Field) -> term)
  • That is, this index contains those segments, each of which contains those documents, each document contains those fields, and each domain contains those words.
  • Since it is a hierarchical structure, each layer stores the information of this level and the meta information of the next level, that is, attribute information, such as a book about Chinese Geography, first, we should introduce the general situation of Chinese Geography and the number of provinces in China. Each province will introduce the basic situation of this province and the number of cities in it, each city introduces the basic situation of the city and the number of counties it contains. Each County details the specific situation of each county.

The so-called reverse information:

  • Saves the ing from the dictionary to the inverted table: Term> document)

Solr4.8.0 source code analysis (8) Lucene index file (1)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.