Lucene Learning Four: Lucene index file Format (3)

Source: Internet
Author: User

This article reproduced from: http://www.cnblogs.com/forfuture1978/archive/2010/02/02/1661436.html, slightly censored and remarks.

Iv. specific Format

4.2. Reverse Information

The reverse information is the core of the index file, which is the reverse index.

The reverse index consists of two parts, the left is the dictionary (term Dictionary), and the right side is the inverted table (Posting list).

In Lucene, these two parts are stored in the sub-file, the dictionary is stored in the Tii,tis, the inverted list is also composed of two parts, part of the document number and word frequency, saved in Frq, part of the position information of the words, stored in the PRX.

    • Term Dictionary (tii, tis)
      • –> frequencies (. Frq)
      • –> positions (. prx)
4.2.1. Dictionary (tis) and dictionary index (tii) information

In the dictionary, all the words are sorted in dictionary order.

  • Dictionary file (tis)
    • Termcount: The total number of words contained in the dictionary
    • Indexinterval: In order to speed up the search for words, we also apply a structure similar to the jumping table, assuming that Indexinterval is 4, the 4th, 8th, 12th words are saved in the dictionary index (tii) file, so that the speed of finding the word in the dictionary file can be speeded up.
    • Skipinterval: Inverted table Regardless of the document number and word frequency, or location information, is the structure of the jumping table exists, Skipinterval is the number of steps to jump.
    • Maxskiplevels: The Jump table is multi-layered, and this value refers to the maximum number of layers in a jump table.
    • Termcount an array of items, each of which represents a word, for each word, the prefix suffix rule holds the text information of the word (prefixlength + Suffix), the field number of the domain to which the word belongs (fieldnum), and how many documents contain the word (docfreq), The inverted list of this word is offset in frq,prx (Freqdelta, Proxdelta), the offset of this word's inverted table in Frq (Skipdelta), where the delta is applied to the difference rule.
  • Dictionary index file (tii)
    • The dictionary index file is designed to speed up the morphemes of dictionary files, saving every indexinterval word.
    • The dictionary index file is loaded into memory.
    • Indextermcount = Termcount/indexinterval: The number of words contained in the dictionary index file.
    • Indexinterval Indexinterval in the same dictionary file.
    • Skipinterval Skipinterval in the same dictionary file.
    • Maxskiplevels maxskiplevels in the same dictionary file.
    • Indextermcount an array of items, each representing a word, each consisting of two parts, the first part being the word itself (TermInfo), and the second part being the offset in the dictionary file (Indexdelta). Assuming that Indexinterval is 4, this array holds the 4th, 8th, 12th words ...
  • the code to read the dictionary and Dictionary index files is as follows:

Origenum = new Segmenttermenum (directory.openinput (segment + "." + Indexfilenames.terms_extension,readbuffersize), Fieldinfos, false);//For reading the tis file

  • int firstint = Input.read Int ();
  • size = Input.readlong ();
  • indexinterval = Input.readint ();
  • skipinterval = Input.readint ();
  • maxskiplevels = Input.readint ();

Segmenttermenum indexenum = new Segmenttermenum (directory.openinput (segment + "." + Indexfilenames.terms_index _extension, Readbuffersize), Fieldinfos, true);//For reading tii file

  • indexterms = new Term[indexsize];
  • Indexinfos = new Terminfo[indexsize];
  • indexpointers = new Long[indexsize];
  • for (int i = 0; Indexenum.next (); i++)
    • indexterms[i] = Indexenum.term ();
    • Indexinfos[i] = Indexenum.terminfo ();
    • Indexpointers[i] = indexenum.indexpointer;
4.2.2. Document number and Word frequency (FRQ) information

Document numbers and Word frequency files are stored in the inverted list, is in the form of jumping table.

    • This file contains Termcount items, each of which has one item, because each word has its own inverted list.
    • For each word of the inverted list includes two parts, part of the inverted table itself, that is, an array of document number and Word frequency, and the other part of the jump table, in order to access and locate the document number in the inverted table and the word frequency position.
    • For the document number and the frequency of the storage application is the difference rule and the probability of following the rules, Lucene's document itself has the following words, more difficult to understand, here to explain:

For example, the termfreqs for a term which occurs once in document seven and three times in document eleven, with Omittf False, would be the following sequence of vints:

15, 8, 3

If Omittf were true it would be this sequence of vints instead:

7,4

First we look at Omittf=false, that is, we store the number of term occurrences in a document in the index.

In the example, a document representing 1 occurrences in document 7 and 3 occurrences in document 11 is represented in the following sequence: 15,8,3.

So how did these three numbers come out?

First, by definition Termfreq---docdelta[, Freq?], a TERMFREQ structure is composed of a docdelta followed by Freq, which is what we say a+b? Structure.

Docdelta naturally wants to store the ID number of the document containing this term, and freq is the number of occurrences in this document.

So according to the example, the complete information that should be stored is [DocID = 7, Freq = 1] [DocID = one, Freq = 3] (see the Basic Principles section of full-text search).

However, in order to save space, Lucene on the number of such data are expressed by the difference, that is to say the rules 2,delta rules, so the document ID can not be stored as a complete information, it should be kept as follows:

[Dociddelta = 7, Freq = 1] [Dociddelta = 4 (11-7), Freq = 3]

However, Lucene has a special way of storing the results of the a+b, as shown in Rule 3, the A+b rule, if the freq followed by Docdelta is 1, it is denoted by the Docdelta last position 1.

If the freq followed by Docdelta is greater than 1, then Docdelta is the last position 0, followed by the real value, thus for the first term, because freq is 1, so the last expression of Docdelta is placed, Dociddelta = 7 of the binary is 000 0111, must be left one bit, and the last one to set one, 000 1111 = 15, for the second term, because the freq is greater than one, so placed in the last position of Docdelta zero, Dociddelta = 4 of the binary is 0000 0100, you must move left one bit, and the last position is zero, 0000 1000 = 8, then follow the real freq = 3.

Then the sequence is obtained: [Docdleta = 15][docdelta = 8, Freq = 3], also known as sequence, 15,8,3.

If omittf=true, that is, we do not store a document in the index of the number of occurrences of the term, then only the docid can be saved, so there is no a+b rule of the application.

[DocID = 7] [DocID = 11], then apply the rule 2,delta rule, then get the sequence [Docdelta = 7][docdelta = 4 (11-7)], that is, the sequence, 7,4.

    • The following points need to be explained for the storage of jump tables:
      • The jumping table can be divided into different levels according to the length of the inverted table itself (docfreq) and the jump Amplitude (skipinterval), the number of levels is Numskiplevels = Min (Maxskiplevels, Floor (Docfreq/log ( Skipinterval))).
      • The number of nodes in the level layer is docfreq/(levels + 1), which counts from zero.
      • In addition to the lowest layer, the other layers have skiplevellength to represent the binary length of the layer (not the number of nodes), which makes it easy to read a layer of jumping tables into the cache.
      • High-rise in front, low in the rear, after reading all the high-level, the rest is the lowest layer, so the last layer does not need skiplevellength. This is why the format of the Lucene document is described as NumSkipLevels-1, skiplevel, or low NumSKipLevels-1 layer has skiplevellength, the last layer only skiplevel, No skiplevellength.
      • In addition to the lowest layer, the other layers have skipchildlevelpointer to point to the corresponding node in the next layer.
      • Each hop node contains the following information: The document number, the length of the payload, the offset in the Frq of the node in the inverted table corresponding to the document number, and the offset in the prx of the node in the inverted list corresponding to the document number.
      • Although Lucene's documentation has the following descriptions, the results of the experiment are not entirely accurate:

Example:skipinterval = 4, Maxskiplevels = 2, Docfreq = 35. Then skip level 0 have 8 skipdata entries, containing the 3rd, 7th, 11th, 15th, 19th, 23rd, 27th, and 31st document numbers In Termfreqs. The Skip Level 1 have 2 skipdata entries, containing the 15th and 31st document numbers in Termfreqs.

As described, when Skipinterval is 4 and there are 35 documents, the Skip level = 0 should include the 3rd, 7th, 11th, 15th, 19th, 23rd, 27th, 31st, and skip level = 1 should include 15th, 31st documentation.

However, the real implementation, the Jumping table node, but the forward offset, the reason for the offset is the following code:

    • Formatpostingsdocswriter.adddoc (int docID, int termdocfreq)
      • Final int delta = docid-lastdocid;
      • if ((++df% skipinterval) = = 0)
        • Skiplistwriter.setskipdata (Lastdocid, Storepayloads, poswriter.lastpayloadlength);
        • Skiplistwriter.bufferskip (DF);

From the code, we can see that when Skipinterval is 4, when docid = 0 o'clock, ++DF is 1,1%4 not 0, not jumping node, when docid = 3 o'clock, ++df=4,4%4 is 0, jumping node, The Skipdata, however, holds the lastdocid for 2.

So the real inverted list and the Skip table save the information:

4.2.3. Word position (PRX) Information

Word position information is also inverted list, but also in the form of jumping tables exist.

    • This file contains termcount items, each word has one item, because each word has its own word position inverted list.
    • For each word there is an array of docfreq size, each representing a document that records where the word appears in this document. This array of documents is also related to the jump table in the Frq file, from which we know that there is proxskip in the Frq Jumping table node, when Skipinterval is 3, the Frq Jumping table node points to the 1th, 4th, 7th, 10th of this array in the Prx file, 13th, 16th document.
    • For each document, it may contain one word multiple times, so there is an array of freq size, and each item represents this word once in this document, and there is a location information.
    • Each location information contains: Positiondelta (using the difference rule), you can also save the payload, apply the probability of following the rule.
4.3. Other information 4.3.1. Normalization factor file (NRM)

Why is there a standardized factor? From the description in the first chapter, we know that in the search process, the search for the document to be in accordance with the relevance of the query statement, the relevance of the high score (score), and thus ranked in front. The correlation score (score) uses the vector space model, which calculates the term Weight before calculating the correlation, that is, the importance of a term relative to a document. In the calculation of term weight, there are two main factors, one is the number of times in this document, one is the general degree of this term. It is clear that the more times this term appears in this document, the more important this term is in this document.

The calculation method of this term weight is the most common, however, there are several problems:

    • Different document importance is different. Some documents are important, some documents are relatively unimportant, such as for software, in the index of the book, I want to make computer books easier to search, and literature in the book searches in the back.
    • Different domains have different importance. Some fields are important, such as keywords, such as headings, some fields are not important, such as attachments. The same term, which appears in the keyword, should be higher than the one that appears in the attachment.
    • Depending on the absolute number of words that appear in the document, it is unreasonable to determine the importance of the word to the document. For example, long document words appear more frequently in the document, so short documents are more likely to suffer. For example, a word in a brick book appeared 10 times, in another less than 100 words in the article appeared 9 times, that the brick book should be ranked in the front yard? Should not, obviously this word in less than 100 words in the article can appear 9 times, see its importance to this article.

For these reasons, Lucene calculates the term weight by multiplying the previous normalization factor (normalization Factor) to reduce the impact of the above three problems.

The normalization factor (normalization Factor) is a calculation that affects subsequent scoring (score), and a portion of Lucene's scoring calculation takes place in the index process, typically a parameter unrelated to the query statement such as a normalization factor, most of which occurs during the search process, is detailed in the code analysis of the search process.

The overall calculation of the normalization factor (normalization Factor) in the indexing process is as follows:

It consists of three parameters:

    • Document boost: The larger the value, the more important it is.
    • Field boost: The larger the field, the more important this field is.
    • Lengthnorm (field) = (1.0/math.sqrt (numterms)): The more term total is contained in a domain, the longer the document, the smaller the value, and the shorter the document, the greater the value.
    • ( how to operate in lucence or intervene in standardized factors for future treatment )

From the above formula, we know that a word (term) appears in different documents or different domains, and the normalization factor is different. For example, there are two documents, each document has two fields, if you do not consider the length of the document, there are four permutations, in important fields of important documents, in the non-important fields of important documents, in the important fields of non-important documents, four combinations in the non-important fields of non-important documents, each have different standardization factors.

So in Lucene, the normalization factor is saved (number of documents multiplied by number of domains), in the following format:

    • Normalization factor file (normalization Factor FILE:NRM):
      • Normsheader: The string "NRM", plus version, differs depending on the Lucene version.
      • Next is an array of size numfields, one for each field, and one for each norms.
      • Norms is also an array of size segsize, that is, the number of documents in this segment, each of which is a byte, representing a floating point, where 0~2 is the mantissa and 3~8 is exponential.
4.3.2. Deleting a document file (DEL)

    • Deleted documents file (Deleted document:. del)
      • Format: In this file, bits and dgaps can only save one of them, 1 means save dgaps, and a non-negative value means save bits.
      • ByteCount: How many documents are in this paragraph, how many bits are saved, but counted as a byte, that is, bits should be a multiple of byte.
      • The number of bits in the bitcount:bits is to 1, indicating that the document has been deleted.
      • Bits: An array of byte, size bytecount, which is considered to be a byte*8 bit when applied.
      • Dgaps: If you delete a small number of documents, BITS is 0, which is a waste of space. Dgaps preserves sparse arrays in the following way: Tenth, 12, 32 documents are deleted, so tenth, 12, 32 bits are set to 1,dgaps also in Byte, only byte not 0 is saved, such as 1th Byte, 4th byte, The 1th byte decimal is 20, and the 4th byte decimal is 1. It is then saved to Dgaps, 1th byte, position 1 is saved with an indefinite long positive integer, a value of 20 is saved with a binary, the 2nd byte, position 4 is saved with an indefinite long positive integer, with a difference of 3, a value of 1 is saved with a binary, and the binary data is not represented by a difference value.
v. Overall structure

    • The overall structure of the Lucene index file is shown:
      • A segment.gen,segment_n that belongs to the entire index (index), which holds the metadata information for segments (segment), and then saves the data information in multiple segment, with the same prefix file name as the same segment.
      • For each segment, contains domain information, word information, and other information (normalization factor, delete document)
      • Domain information also includes domain metadata information, in FNM, the domain's data information, in FDX,FDT.
      • The word information is reverse information, including the dictionary (tis, tii), document number and the word frequency inverted list (FRQ), and the Position inverted table (PRX).

You can see the source code, the corresponding reader and writer to understand the structure of the file, will be more thorough

Lucene Learning Four: Lucene index file Format (3)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.