Lucene Structure Description

Last Update:2018-12-05 Source: Internet

Author: User

Tags ranges

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Source: www.matrix.com.cn

This document defines the format of the index file used by Lucene (version 1.3.

Jakarta Lucene is written in Java, and many groups are quietly using other programming languages to rewrite it. If these new versions are compatible with Jakarta Lucene, a language-independent Lucene index file format is required. This article attempts to provide a complete specification definition of the language-independent Jakarta Lucene 1.3 index file format.

As Lucene continues to develop, this article should also be updated. The Lucene implementation versions written in different languages must comply with the file format and generate a new version in this article.

This article also provides compatibility annotations to describe the format of the file, which is different from that of the previous version.

Definition

The most basic concepts in Lucene are index, document, field, and term ).

An index contains a document sequence.

· Documents are sequences of some domains.

· A domain is a sequence of items.

· An item is a string.

The same string that exists in different domains is considered to be different items. Therefore, the item is actually represented by a pair of strings. The first string is the domain name, and the second string is the string in the domain.

Inverted index

To make the item-based search more efficient, the items in the index are stored statically. Lucene indexes are inverted indexes in the indexing method, because an index such as Lucene can list documents containing it. This is the natural link between documents and items.

Domain type

In Lucene, the text of the domain may be stored in the index in a non-inverted way. The inverted domain is called indexed. The domain may be stored and indexed at the same time.

The text of a domain may be indexed by many projects, or being indexed as a project. Most of the fields are broken down, but sometimes some identifiers are used as a project index.

Segment (segment)

Lucene indexes may consist of multiple sub-indexes which become segments. Each segment is a complete and independent index that can be searched. The index is as follows:

1. Create a new segment for the newly added document.

2. merge existing segments.

Multiple segments and/or multiple indexes are involved in the search. Each index may consist of several segments.

Document number)

Internally, Lucene uses an integer document number to indicate the document. The first document to be added to the index is numbered 0. The document to be added in sequence will get a number incremented by the previous number.

Note that document numbers may change, so be careful when storing these numbers externally in Lucene. In particular, the number changes are as follows:

· Only numbers in segments are the same and different segments are different. Therefore, when using these numbers in a wider context, you must change them. The standard technique is to assign a field number to each segment based on the number of each segment. Add the field number when converting the document number in the specified segment to a non-section. When a document number outside a certain segment is converted to a specific segment, the document belongs to the segment based on the possible number range after the conversion in each segment, and the segment number is reduced. For example, if two segments with five documents are merged, the first segment is 0, and the second segment is 5. In the third document in section 2, the number out of the section is 8.

· After the document is deleted, consecutive numbers are interrupted. This can be solved by merging indexes. The documents deleted during segment merging are also deleted, and the newly merged segments are not interrupted by numbers.

Introduction

The index segment maintains the following information:

· Domain set. Contains all the fields used in the index.

· Domain value storage table. Each document contains a list of "Attribute-value" pairs. The attribute is the domain name. This list stores additional information about a document, such as the title, URL, or an ID used to access the database. The set of storage domains can be returned during search. This table is identified by document number.

· Item dictionary. This dictionary contains all the items used in all the documents in the domain, the document numbers used in the documents, and pointers pointing to the usage frequency and location information.

· Item frequency information. For each item in the item dictionary, This information includes the total number of documents containing this item, and the number of times used in each document.

· Item location information. Each item in the item dictionary is stored in every position in each document.

· Normalization factors. For each field in each document, a value is stored that is multiplied into the score for hits on that field. Standardization factor. Each field in the document has a value that is used to multiply the hit number (hits) of this field ).

· Deleted document information. This is an optional file to indicate that the documents have been deleted.

The following sections describe the information in detail.

File naming)

Files of the same segment have the same file name and different extensions. The extension is determined by the various file formats discussed below.

In general, an index stores a directory, and all its segments are stored in this directory, although we do not require you to do so.

Primitive types)

Byte

The most basic data type is byte ). Files are accessed in byte order. Other data types are also defined as byte sequences. The file format is byte independent.

Uint32

A 32-bit unsigned integer, which consists of four bytes and has a high priority.

Uint32 --> <byte> 4

Uint64

A 64-bit unsigned integer consisting of eight-character segments with a high priority.

Uint64 --> <byte> 8

Vint

A variable-length positive integer. The maximum bit of each byte indicates the number of bytes remaining. The lower seven bits per byte indicate the integer value. Therefore, the single-byte value ranges from 0 to 127, and the two-byte value ranges from 128 to 16,383.

Example of Vint Encoding

Value
First byte
Second byte
Third byte

0
00000000

1
00000001

2
00000010

...

127
01111111

128
10000000
00000001

129
10000001
00000001

130
10000010
00000001

...

16,383
11111111
01111111

16,384
10000000
10000000
00000001

16,385
10000001
10000000
00000001

...

This encoding provides a way to compress data in High-Efficiency decoding.

Chars

Lucene outputs UNICODE character sequences encoded using a standard UTF-8.

String

Lucene outputs a string composed of Vint and string. Vint indicates the string length, followed by the string.

String --> Vint, Chars

Index files (Per-index files)

This section describes the files contained in each index.

Segments File

The active segments in the index are stored in the segments file. Each index can contain only one file named "segments". This file lists the names and sizes of each segment in sequence.

Segments --> segcount, <segname, segsize> segcount

Segcount, segsize --> uint32

Segname --> string

Segname indicates the name of the segment and serves as the prefix for indexing other files.

Segsize is the number of documents contained in the segment index.

Lock File

Some files are used to indicate that another process is using indexes.

· If "commit. lock "file, indicating that a process is writing the" segments "file and deleting useless segment index files, or that a process is reading the" segments "file and opening files of certain segments. After a process reads the segment information of a "segments" file, the lock file prevents another process from deleting the files before it can open the files of all segments.

· If the "index. Lock" file exists, a process adds a document to the index or deletes the document from the index. This file prevents many files from modifying an index at the same time.

Deleteable File

A file named "deletetable" contains the names of files that are no longer used by the index. These files may not be actually deleted. This situation exists only on the Win32 platform, because files under Win32 are still open and cannot be deleted.

Deleteable --> delablecount, <delablename> delablecount

Delablecount --> uint32

Delablename --> string

Files contained in the segment (Per-segment files)

The remaining files are included in each segment, so they are differentiated by suffixes.

Field)

Field info)

All domain names are stored in the domain set information of this file. This file is suffixed with. FNM.

Fieldinfos (. FNM) --> fieldscount, <fieldname, fieldbits> fieldscount

Fieldscount --> Vint

Fieldname --> string

Fieldbits --> byte

Currently, fieldbits is used only at a low level. The value of the indexed domain is 1, and the value of the unindexed domain is 0.

The fields in the file are numbered according to their order. Therefore, domain 0 is the first domain in the file, and domain 1 is the next domain, and so on. This is the same as document number.

Domain value storage table (stored fields)

The Domain value storage table uses two files:

1. domain index (. fdx file ).

For each document, this file contains a pointer to the domain value:

Fieldindex (. fdx) --> <fieldvaluesposition> segsize

Fieldvaluesposition --> uint64

Fieldvaluesposition indicates the location of the Domain value of a document in the Domain value file. Because the Domain value file contains a fixed length of data information, it is easy to randomly access. In the domain value file, the domain value information of Document N exists at location N * 8 (The position of Document N & #39; s field data is the uint64 at N * 8 in this file .).

2. Domain value (. FDT file ).

The Domain value information of each document is as follows:

Fielddata (. FDT) --> <docfielddata> segsize

Docfielddata --> fieldcount, <fieldnum, bits, value> fieldcount

Fieldcount --> Vint

Fieldnum --> Vint

Bits --> byte

Value --> string

Currently, BITs is used only at a low level. A value of 1 indicates that the domain name has been decomposed. A value of 0 indicates that the domain name has not been decomposed.

Term dictionary)

The item dictionary is represented by the following two files:

1. Item information (. Tis file ).

Terminfofile (. Tis) --> termcount, terminfos

Termcount --> uint32

Terminfos --> <terminfo> termcount

Terminfo --> <term, docfreq, freqdelta, proxdelta>

Term --> <prefixlength, suffix, fieldnum>

Suffix --> string

Prefixlength, docfreq, freqdelta, proxdelta
--> Vint

Items are sorted by items. When sorting item information, sort by the text order of the field to which the item belongs, and then sort by the text order of the string of the item.

The word prefix of the item is often the same, and it is composed of the Word Suffix. The prefixlength variable indicates the number of words with the same prefix as the previous one. Therefore, if the first item is "bone" and the last item is "boy", the prefixlength value is 2 and the suffix value is "Y ".

Fieldnum indicates the domain Number of the item, and the domain name is stored in the. FDT file.

Docfreg indicates the number of documents containing this item.

Freqdelta specifies the location of the termfreq variable to which the item belongs in the. frq file. In details, it refers to the position offset (or 0, indicating the first item in the file) relative to the data of the previous item ).

Proxdelta specifies the location of the termposition variable to which the item belongs in the. PRx file. In details, it refers to the position offset (or 0, indicating the first item in the file) relative to the data of the previous item ).

2. Item information index (. tii file ).

Each item information index file contains 128 entries in the. Tis file, according to the order of the entries in the. Tis file. This design aims to read the index information into the memory at a time, and then use it to randomly access the. Tis file.

The structure of this file is very similar to that of the. Tis file. Only one variable indexdelta is added to each entry record.

Terminfoindex (. tii) --> indextermcount, termindices

Indextermcount --> uint32

Termindices --> <terminfo, indexdelta> indextermcount

Indexdelta --> Vint

Indexdelta indicates the position of the terminfo variable value in the. Tis file. In details, it refers to the offset (or 0) relative to the first entry in the file ).

Item Frequency (frequencies)

The. frq file contains a list of each document and the frequency of this item in the corresponding document.

Freqfile (. frq) --> <termfreqs> termcount

Termfreqs --> <termfreq> docfreq

Termfreq --> docdelta, freq?

Docdelta, freq --> Vint

The termfreqs sequence is sorted by items (based on the items in the. Tis file, that is, the items exist implicitly ).

Termfreq tuples are listed in ascending order of document numbers.

Docdelta determines the document number and frequency. In details, docdelta/2 indicates the offset relative to the previous document number (or 0, indicating that this is the first item in termfreqs ). When docdelta is an odd number, it indicates that the intermediate frequency is 1 in this document. When docdelta is an even number, another VINT (freq) indicates the frequency of occurrence in this document.

For example, if one item appears in document 7 and appears 3 times in document 11, the following vints sequence exists in termfreqs:

15, 22, 3

Position)

The. PRx file contains a list of the location information of an item in a document.

Proxfile (. PRx) --> <termpositions> termcount

Termpositions --> <positions> docfreq

Positions --> <positiondelta> freq

Positiondelta --> Vint

Termpositions are sorted by items (based on the items in the. Tis file, that is, the items exist implicitly ).

Positions tuples are listed in ascending order of document numbers.

Positiondelta is the offset from the previous position (or 0, indicating this is the first time in this document ).

For example, if one item appears in 4th items of a document and 5th and 9th items appear in another document, the following Vint sequence exists:

4, 5, 4

Normalization factor)

The. NRM file contains the standardization factor for each document. The standardization factor is used to multiply the number of hits in this field.

Norms (. NRM) --> <byte> segsize

Each byte records a floating point number. The digit 0-2 contains the three-digit ending part, and the digit 3-8 contains the five-digit exponent part.

These bytes can be converted to IEEE Standard Single-precision floating point numbers according to the following rules:

1. If the byte is 0, it is a floating point 0;

2. Otherwise, set the flag of the new floating point to 0;

3. Add the exponent in the byte value to 48 and use it as the index of the new floating point number;

4. Map the ending number in the byte to the 3-bit high of the ending number of the new floating point number.

5. Set the 21-bit lower of the ending number of the new floating point to 0.

Deleted document)

The. Del file is optional and exists only after a certain segment has been deleted:

Deletions (. DEL) --> bytecount, bitcount, Bits

Bytesize, bitcount --> uint32

Bits --> <byte> bytecount

Bytecount indicates the number of bytes in the bits list. Typically, it is equal to (segsize/8) + 1.

Bitcount indicates the number of bits in the list that have been set.

The bits list contains some bits that represent a document in sequence. When the bit corresponding to the document number is set, it indicates that the document has been deleted. The order of BITs is from low to high. Therefore, if bits contains two bytes, 0x00 and 0x02, it indicates that document 9 has been deleted.

Limitations)

In the preceding file format, there are several restrictions and the maximum number of documents is the 32-digit limit, that is, close to 4 billion. Today, this will not cause problems, but in the long run, it may cause problems. Therefore, these limits should be changed to the uint64 type value or, better, to the Vint type value (the Vint value has no upper limit ).

The code in two places must be a fixed length value. They are:

1. fieldvaluesposition variable (stored in the. fdx file of the domain index file ). It is already a uint64 type, so there is no problem.

2. The termcount variable (stored in the. Tis file ). This is the final output to the file, but it is read first, so it is the front-end of the file. The index code first writes a 0 value here, and then overwrites this value after other files are output. Therefore, no matter where it is stored, it must be a fixed-length value and it should be changed to the uint64 type.

In addition, all uint values can be changed to Vint type to remove the restriction.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More