Lucene file format to be sorted

Source: Internet
Author: User

This is the index format generated by ipve3.0.

Table

Table B

,

C. This is an online image (because the segment of the above two tables is merged)

 

Indexes created by ipve4.9:

 

Index ):
An index in Lucene is placed in a folder.
For example, all files in the same folder constitute a Lucene index.
Segment (segment ):
An index can contain multiple segments, which are independent from each other. You can add new documents to generate new segments and merge different segments.
For example, if a file with the same prefix belongs to the same segment, the following two segments are displayed: "_ 0" and "_ 1 ".
Segments. Gen and segments_5 are metadata files of segments, that is, they save the attribute information of segments.
Document ):
Documents are the basic unit for indexing. Different documents are stored in different segments. A segment can contain multiple documents.
Newly Added documents are stored separately in a new segment. As the segments are merged, different documents are merged into the same segment.
Field ):
A document contains different types of information and can be indexed separately, such as the title, time, body, and author, and can be stored in different fields.
The indexing methods of different domains can be different. We will explain in detail when actually parsing domain storage.
Term ):
The word is the smallest unit of the index and is a string after lexical analysis and language processing.

 

Abbreviated File Format

. FDT field data

. Fdx field Index

. FNM field name

. Frq Frequencies

. NRM norms

. PRx proxfile

. Tii term info Index

. Tis term Infos

Segments. gen

Segments_n

 

The so-called positive information:

  • The link from index to word inclusion is saved hierarchically: index> segment> document> Field) -> term)
  • That is, this index contains those segments, each of which contains those documents, each document contains those fields, and each domain contains those words.
  • Since it is a hierarchical structure, each layer stores the information of this level and the meta information of the next level, that is, attribute information, such as a book about Chinese Geography, first, we should introduce the general situation of Chinese Geography and the number of provinces in China. Each province will introduce the basic situation of this province and the number of cities in it, each city introduces the basic situation of the city and the number of counties it contains. Each County details the specific situation of each county.
  • For example, the following files contain forward information:
    • Segments_n stores the number of segments contained in the index and the number of documents contained in each segment.
    • . FNM stores the number of domains in this segment, the names of each domain, and the indexing method.
    • . Fdx and. FDT save all the documents contained in this section, the number of fields contained in each document, and the information stored in each domain.
    • . Tvx,. TVD, And. tvf Save the number of documents contained in this section, the number of fields contained in each document, the number of words contained in each field, the string and position of each word, and other information.

The so-called reverse information:

  • Saves the ing from the dictionary to the inverted table: Term> document)
  • For example, files that contain reverse information include:
    • Xxx. Tis, XXX. tii stores the term dictionary, that is, all words contained in this segment are ordered alphabetically.
    • Xxx. frq saves the inverted table, that is, the list of document IDs containing each word.
    • Xxx. PRx stores the position of each word in the inverted table in the document containing the word.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.