"Lucene" Lucene Learning Index file structure

Source: Internet
Author: User

Lucene Index file structure

Basic concepts

    • Indexing (Index)
      • Lucene's index consists of a number of files, which are placed in the same directory
    • Segment (segment)
      • An index of Lucene consists of multiple segments, and the segments are independent from each other. When you add a new document, you can create a new segment, reach the threshold (number of segments, number of files included in the segment, and so on), and the different segments can be merged.
      • Under Folders, files with the same prefix belong to the same segment
      • Segments.gen and Segments_n (N for a specific number, eg:segments_5) are the metadata files for the segment, and they hold the attribute information for the segment.
    • Documents (document)
      • The base unit in which the document is indexed, and a segment can contain multiple documents
      • When a newly added document is saved separately in a newly generated segment, the different documents are merged into the same segment as the segments are merged.
    • Field (field)
      • A document can consist of multiple fields (field), such as a piece of news, a title, an author, a body, and more, which can be thought of as the domain of a document.
      • Different domains can specify different ways of indexing, such as specifying a different word-breaker, whether to build an index, whether to store it, etc.
    • Word (term)
      • A word is the smallest unit of an index, a string after lexical participle and language processing.

forward Information : index--Document--Domain (field)--word (term)

Overall structure (image from Network)

An index file for Lucene is stored in the same folder and consists of multiple files.

Segments.gen : Used to help navigate to the latest segments_n.

Segments.gen file format: Read Gen file, then determine if version is correct, then read Gen0 and Gen1, if two values are equal, then genb=gen0;

In addition, it selects the Segments_n file under the folder where index is located, selects the largest one as Gena, then compares Gena and Genb, finds the largest one, and finally opens Segments_n.

Indexinput geninput = Directory.openinput (Indexfilenames.segments_gen);//"Segments.gen"intVersion = Geninput.readint ();//Read the version numberif(Version = = format_lockless) {//If the version number is correct    LongGen0 = Geninput.readlong ();//Read the first n    LongGen1 = Geninput.readlong ();//Read the second n    if(Gen0 = = gen1) {//Genb If the two are equalGenb =Gen0; } }if(GenA >Genb) Gen=GenA;ElseGen= Genb;

segments_n: The metadata information file for the segment. Saves information such as how many segments this index contains, and how many documents are included in each segment.

    • Format:
      • The version number of the indexed file format. Since Lucene is in the process of continuous development, different versions of Lucene may have different indexed file formats, so the version number of the file format is specified.
    • Version
      • Version number of the index
    • Namecount
      • The segment name of the next new paragraph
    • Segcount
      • Number of segments

Segcount metadata information for a segment:

  • segment:
    • Segname
      • Segment Name
    • segsize
      • The number of documents contained in this paragraph
      • contains documents that have been deleted and have not yet been optimize. Because the paragraph in Lucene contains all the indexed documents before optimize, the deleted document is saved in the. del file, and during the search process, the deleted document is read from the paragraph, and then the document is filtered out using the. del flag. A merge of segments is triggered when the
      • optimize, and the deleted documents are not merged into the new segment at this time.
    • Delgen
      • The version number of the. del file
      • Before optimize, the deleted documents are saved in the. del file.
      • Several ways to delete a document (can be deleted by Indexreader or IndexWriter)
        • indexreader.deletedocument (int docID) is deleted according to the document number
        • Indexreader.deletedocuemnts (term) Delete a document containing this term
        • Indexwriter.deletedocuemnts (term) Delete the document containing this term (using Inderwriter)
        • Indexwriter.deletedocuments (term[] terms) Delete documents containing these terms
        • Indexwriter.deletedocuemnts (query query) to delete a document that satisfies a secondary query
        • Indexwriter.deletedocuemnts (query[] queries) Delete documents that satisfy these queries
        • The original version of Lucene's deletion has been done by the Inderreader, although later can be deleted with IndexWriter, in fact, the real or indexreader to complete. IndexWriter Save the Indexreader in Readerpool, delete the time to remove from the completion of the operation.
        • Delgen is whenever IndexWriter commits a delete operation to the index file, add 1, and generate a new. del file

segment_n and Segment file formats :

A segment (Segment) contains multiple domains, each of which has metadata information stored in the. fnm file, with the format of the. fnm file as follows:

The data information for the domain is stored in the. FDT and. fdx Files:

where the. fdx file holds Fieldvaluespositon, pointing to the. fdt file, which means that the specific data for the domain is stored in the Fdt file.

    • Domain data file (FDT):
      • The Fdt file that actually holds the storage domain (stored field) information
      • There is a total of segment size document in a paragraph, all FDT documents have segment size items, and each item holds the field information for one document.
      • Each document corresponds to a fieldcount that represents the number of fields that this document contains, followed by FieldCount items, and each item holds information for one field
      • For each domain, Fieldnum is the domain number, followed by a byte,8bit, according to the fill of 0/1, representing a different meaning, the lowest one indicates whether this field is a word breaker, the second-to-last bit indicates if this field holds string data or binary data, and the last third digit indicates whether the field is compressed. The last store is the value of this storage domain.

          

    • Domain index file (FDX):
      • The domain data file format shows that each document contains the number of fields, each storage domain value is not the same, because the domain data file in the segment size document, each document occupies the same size, then how quickly in the Fdt file to identify the starting and ending address of each document? How can I find information about the storage domain of the nth document faster? This will require borrowing the Group domain index file
      • The domain index file also has a total of segment size items, each document has an entry, each item is a long, fixed size, each item is the corresponding document in the FDT start address offset.

word vectors (term vector) data information (. TVX,. TVD,. TVF)

Word vector information is forward information from index to Document to field (field) to Word (term), and with Word vector information, you can get information about which words are contained in a document.

    • Word vector index file (TVX):
      • A segment (segment) contains n documents, which have n entries, each representing a document
      • Each item contains two pieces of information: The first part is the offset of this document in the Word vector document file (TVD), and the second part is the offset of the first field of this document in the word vector file (TVF).
    • Word Vector document file (TVD):
      • Each item is first the number of fields contained in this document Numfields, then an array of numfields sizes, each of which is a field number, and then an array of (NumField-1) size, and the offset information for the first field in the TVF file is stored in the TVX file for each document. The offset of the other (NumFields-1) fields in TVF is the offset of the first field plus the value of each of these (NumField-1) arrays.
    • Word vector field file (TVF):
      • This file contains all the fields in this paragraph, and does not differentiate the document, exactly how many domains to the first domain belong to that file, which is the offset of the initial domain in the TVX file and the offset of the (NumField-1) domain in the TVD file to determine which domain data that document is.
      • For each domain, the first is the number of words contained in this field numterms, then a byte of 8bit, and the last one specifies whether to save the location information, and the penultimate digit indicates whether to save the offset information. Then is an array of numterms items, each of which represents a word (term), for each word, by the text of the word termtext, the word frequency termfreq (the number of occurrences of the word in the document), the position information of the words, the offset information of the word.

Reference: http://www.cnblogs.com/forfuture1978/archive/2009/12/14/1623599.html

"Lucene" Lucene Learning Index file structure

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.