Index file format for Lucene003_lucene

Source: Internet
Author: User

The most important index of Lucene is how to store it inside? That is, how the format of the Lucene index file is saved.
    • Lucene's indexing process is the process of full-text retrieval, which writes the inverted table to this file.
    • Lucene's search process is the process of reading the indexed information in the format of this file and then scoring each document.

When you write an index-search demo, when you run the program, the index file is generated in the specified indexdirectory:

Format of the index:

Lucene's index structure is hierarchical and consists of the following levels:

  • Index:
    • In Lucene, an index is placed in a folder. (developer designation)
    • For example, all files in the same folder form a Lucene index .
  • Segment (Segment):
    • An index can contain multiple segments, a separate segment from the segment, and a new document can be added to create new segments, and different segments can be merged.
    • For example, the same prefix file is the same segment, the figure is a total of two segments "_0" and "_1".
    • Segments.gen and Segments_5 are metadata files for segments, that is, they hold the attribute information for a segment.
  • Document:
    • Documents are the basic unit of our index, and different documents are saved in different segments, and one segment can contain multiple documents.
    • The newly added documents are saved separately in a newly generated segment, and as the segments are merged, different documents are merged into the same segment.
  • Domain (field):
    • A document contains different types of information that can be indexed separately, such as title, Time, body, author, etc., and can be stored in different domains.
    • Different domains can be indexed differently, and we will interpret them in detail when the storage of the domain is truly resolved.
  • Word (term):
      • A word is the smallest unit of an index, which is a string of lexical parsing and language processing.

In the index structure of Lucene, the forward information is saved, and the reverse information is saved.

The so-called positive information:

  • Saved from index to Word, by hierarchy: Index (Index) –> segment (segment) –> document –> domain (field) –> Word (term)
  • That is, the index contains those segments, each containing those documents, each containing those fields, each containing those words.
  • Since it is a hierarchy, each level holds the information at this level and the next level of meta-information, i.e., property information, such as a book on Chinese geography, which should first introduce the geography of China, and how many provinces are included in China, each of which introduces the basic profile of the province and how many cities it contains. Each city introduces a basic overview of the city and how many counties are included, each of which specifically describes each county's specific situation.
  • For example, files that contain forward information are:
    • Segments_n saves how many segments this index contains, and how many documents each segment contains.
    • XXX.FNM This section contains the number of fields, the name of each domain, and how it is indexed.
    • XXX.FDX,XXX.FDT saves all the documents contained in this section, how many fields each document contains, and what information is saved for each domain.
    • XXX.TVX,XXX.TVD,XXX.TVF saves how many documents this segment contains, how many fields each document contains, how many words each field contains, the string of each word , and the location of the information.

The so-called reverse information:

    • Saved dictionary-to-inverted table mappings: Word (term) –> document
    • For example, files that contain reverse information are:
      • Xxx.tis,xxx.tii Saves the Dictionary (term Dictionary), which is the order in which all the words contained in this paragraph are sorted in a dictionary.
      • Xxx.frq saves the inverted list, which is the document ID of each word.
      • Xxx.prx saves the position of each word in the inverted list in the document containing the word.

Index–> segments (Segments.gen, Segments_n) –> Field (FNM, FDX, FDT) –> term (TVX, TVD, TVF).

Segments.gen and Segments_n Save the Metadata information (metadata) of the segment (segment), in fact, each index one, and the segment of the real data information, is saved in the field (field) and the word (term)

Index file format for Lucene003_lucene

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.