Iv. Specific format
4.2. Reverse Information
Reverse Information is the core of the index file, that is, reverse index.
The reverse Index consists of two parts: the left side is the Dictionary and the right side is the inverted table (Posting List ).
In Lucene, these two parts are stored in files, the dictionary is stored in tii and tis, And the inverted table contains two parts: the document number and word frequency, and saved in frq, A part is the location information of words, which is saved in prx.
- Term Dictionary (tii, tis)
- -> Frequencies (. frq)
- -> Positions (. prx)
4.2.1. Dictionary (tis) and dictionary index (tii) Information
In a dictionary, all words are ordered alphabetically.
- Dictionary file (tis)
- TermCount: Total number of words in the dictionary
- IndexInterval: to speed up word search, the structure similar to the Skip table is also applied. If IndexInterval is 4, 4th and 8th words are saved in the dictionary index (tii) file, 12th words, which can speed up searching for words in dictionary files.
- SkipInterval: the inverted table contains the document number, word frequency, and location information in the structure of the jump table. SkipInterval is the number of skip steps.
- MaxSkipLevels: the hop table has multiple layers. This value indicates the maximum number of layers in the hop table.
- An array of TermCount items. Each item represents a word. For each word, PrefixLength + Suffix is used to store the text information of the word ), the domain Number of the domain to which the word belongs (FieldNum). How many documents contain the word (DocFreq)? The offset of the word in frq, prx (FreqDelta, ProxDelta ), the offset (SkipDelta) of the hop table in the inverted table of this word in frq. Delta is used here to apply the difference rule.
- Dictionary index file (tii)
- The dictionary index file is used to speed up searching for words in the dictionary file and store words at intervals of IndexInterval.
- Dictionary index files are all loaded into the memory.
- IndexTermCount = TermCount/IndexInterval: Number of words contained in the dictionary index file.
- IndexInterval is the same as IndexInterval in the dictionary file.
- SkipInterval is the same as SkipInterval in the dictionary file.
- MaxSkipLevels is the same as MaxSkipLevels in the dictionary file.
- An array of IndexTermCount items. Each item represents a word. Each item consists of two parts: the first part is the word itself (TermInfo), and the second part is the offset (IndexDelta) in the dictionary file ). Assume that IndexInterval is 4, and the array contains 4th, 8th, and 12th words...
- The code for reading dictionary and dictionary index files is as follows:
OrigEnum = new SegmentTermEnum (directory. openInput (segment + "." + IndexFileNames. TERMS_EXTENSION, readBufferSize), fieldInfos, false); // used to read tis files
- Int firstInt = input. readInt ();
- Size = input. readLong ();
- IndexInterval = input. readInt ();
- SkipInterval = input. readInt ();
- MaxSkipLevels = input. readInt ();
SegmentTermEnum indexEnum = new SegmentTermEnum (directory. openInput (segment + "." + IndexFileNames. TERMS_INDEX_EXTENSION, readBufferSize), fieldInfos, true); // used to read tii files
- IndexTerms = new Term [indexSize];
- IndexInfos = new TermInfo [indexSize];
- IndexPointers = new long [indexSize];
- For (int I = 0; indexEnum. next (); I ++)
- IndexTerms [I] = indexEnum. term ();
- IndexInfos [I] = indexEnum. termInfo ();
- IndexPointers [I] = indexEnum. indexPointer;
|
4.2.2. Document No. And term frequency (frq) Information
The document No. and Word Frequency files store inverted tables, which exist in the form of skip tables.
- This file contains TermCount items. Each word has an item because each word has its own inverted table.
- The inverted table of each word contains two parts, one is the inverted table itself, that is, the document number and Word Frequency of an array, and the other is the Skip table, in order to quickly access and locate the document number and word frequency in the inverted table.
- For the storage of document numbers and Word Frequency, there are difference rules and follow-up rules. Lucene's document itself has the following statements, which are hard to understand. Here, we will explain:
For example, the TermFreqs for a term which occurs once in document seven and three times in document eleven, with omitTf false, wocould be the following sequence of VInts: 15, 8, 3 If omitTf were true it wocould be this sequence of VInts instead: 7, 4 First, let's look at the case where omitTf = false, that is, we store the number of times a term appears in a document in the index. As mentioned in the example, a document that appears once in document 7 and appears three times in document 11 is represented in the following sequence: 15,8, 3. How are these three numbers calculated? First, define TermFreq --> DocDelta [, Freq?], A TermFreq structure is composed of A DocDelta followed by Freq, that is, the + B we mentioned above? Structure. DocDelta naturally wants to store the ID number of the document containing this Term. Freq is the number of times this document appears. Therefore, according to the example, the complete information to be stored is [DocID = 7, Freq = 1] [DocID = 11, Freq = 3] (see the basic principles of full-text search ). However, to save space, Lucene uses the difference value for the data such as number, that is, the rule 2 and Delta mentioned above, so the Document ID cannot be saved according to the complete information, it should be stored as follows: [DocIDDelta = 7, Freq = 1] [DocIDDelta = 4 (11-7), Freq = 3] But what about A + B? This kind of result can be stored in A special way. For details, see rule 3, that is, A + B? Rule. If the Freq followed by DocDelta is 1, it is represented by DocDelta's last position 1. If the Freq followed by DocDelta is greater than 1, DocDelta gets 0 at the last position and then follows the real value. Therefore, for the first Term, since Freq is 1, therefore, put it in the last digit of DocDelta, And the binary value of DocIDDelta = 7 is 000 0111, which must be shifted to the left and the last position is 1, 000 1111 = 15. For the second Term, since Freq is greater than one, it is placed at zero in the last position of DocDelta. The binary value of DocIDDelta = 4 is 0000 0100, which must be shifted to one left and zero in the last position, 0000 1000 = 8, then follow the real Freq = 3. The obtained sequence is [DocDleta = 15] [DocDelta = 8, Freq = 3], that is, the sequence, 3. If omitTf = true, that is, we do not store the number of times A Term appears in A document in the index, only the DocID is saved, so there is no A + B? The application of the rule. [DocID = 7] [DocID = 11], and then apply Rule 2 and Delta to obtain the sequence [DocDelta = 7] [DocDelta = 4 (11-7)], that is, the sequence, 7, 4. |
- For the storage of skip tables, you need to explain the following:
- The hop table can be divided into different layers based on the length (DocFreq) and the hop range (SkipInterval) of the inverted table. The number of layers is NumSkipLevels = Min (MaxSkipLevels, floor (log (DocFreq/log (SkipInterval )))).
- The number of nodes in the Level layer is DocFreq/(SkipInterval ^ (Level + 1), and the level is counted from zero.
- Except for the lowest layer, all other layers have SkipLevelLength to indicate the binary length of this layer (rather than the number of nodes), so that you can easily read the jump table from a layer to the cache.
- The top layer is in the front, the lower layer is in the back, and after reading all the top layers, the rest is the lowest layer, so the last layer does not need SkipLevelLength. This is also why the Lucene document's format is described as NumSkipLevels-1, SkipLevel, that is, the low NumSKipLevels-1 layer has SkipLevelLength, the last layer only SkipLevel, no SkipLevelLength.
- Except for the lowest layer, all other layers have SkipChildLevelPointer to point to the corresponding node of the next layer.
- Each hop node contains the following information: document number, payload length, offset of the node in the inverted table corresponding to document number in frq, and offset of the node in the inverted table corresponding to document number in prx.
- Although Lucene has the following descriptions, the experiment results are not completely accurate:
Example: SkipInterval = 4, MaxSkipLevels = 2, DocFreq = 35. then skip level 0 has 8 SkipData entries, containing the 3rd, 7th, 11th, 15th, 19th, 23rd, 27th, and 31st document numbers in TermFreqs. skip level 1 has 2 SkipData entries, containing the 15th and 31st document numbers in TermFreqs. According to the description, when SkipInterval is 4 and there are 35 Documents, Skip level = 0 should include 3rd, 7th, 11th, 15th, 19th, 23rd, 27th, 31st, and documents, skip level = 1 should include 15th and 31st documents. However, in the real implementation, the table node jumps forward, because of the following code:
- FormatPostingsDocsWriter. addDoc (int docID, int termDocFreq)
- Final int delta = docID-lastDocID;
- If (++ df % skipInterval) = 0)
- SkipListWriter. setSkipData (lastDocID, storePayloads, posWriter. lastPayloadLength );
- SkipListWriter. bufferSkip (df );
From the code, we can see that when SkipInterval is 4, when docID = 0, ++ df is 1, 1% 4 is not 0, not a skip node, when docID = 3, + df = 4, 4% 4 is 0, which is a skip node. However, what is saved in skipData is lastDocID 2. So the real inverted table and skip table Save the following information: |
4.2.3. Word location (prx) Information
Word location information is also an inverted table, which also exists in the form of a jump table.
- This file contains TermCount items. Each word has an item, because each word has its own word position inverted table.
- Each word has an array of DocFreq sizes. Each item represents a document that records the position where the word appears. This document array is also related to the hop table in the frq file. From the above, we know that the hop table node in frq has a ProxSkip. When the SkipInterval is 3, the Skip table node of frq points to the 1st, 4th, 7th, 10th, 13th, and 16th documents in the prx file.
- Each document may contain a word multiple times, so there is an array of Freq size. each item indicates that the word appears once in this document, and there is a location information.
- Each location information includes PositionDelta (using the difference rule). You can also save the payload, and the application may follow the rule.
4.3. Other information
4.3.1. Standardization factor document (nrm)
Why is there a standardization factor? From the description in the first chapter, we know that during the search process, the searched documents should be sorted by the relevance of the query statement, and the score (score) with high relevance is ranked first. The correlation score uses the Vector Space Model. Before calculating the correlation, You need to calculate the Term Weight, that is, the importance of a Term to a Document. When calculating Term Weight, there are two main influencing factors: one is the number of times this Term appears in this document, and the other is the general degree of this Term. Obviously, the more times this Term appears in this document, the more important this Term is in this document.
This Term Weight calculation method is the most common, but there are the following problems:
- Different documents have different importance. Some documents are more important and some documents are less important. For example, when I index books for software, I want to make computer books easier to find, in literature, books are ranked back in search.
- The importance of different domains is different. Some domains are important, such as keywords, such as titles, and some are not important, such as attachments. For the same Term, it should be higher in the keyword than in the attachment.
- The importance of a Term to a document is determined based on the absolute number of times it appears in the document. For example, long document words appear more frequently in the document, and such short documents suffer losses. For example, if a word appears 10 times in a brick book and nine times in another article with less than 100 words, it indicates that the brick book should be placed in the front of the Code? No. Obviously, this word can appear nine times in less than 100 words, showing its importance in this article.
For the above reasons, Lucene will multiply the Normalization Factor when calculating Term Weight to reduce the impact of the above three problems.
The Normalization Factor affects the subsequent score calculation. Part of Lucene's score calculation occurs during the index process, generally, parameters unrelated to query statements, such as standardization factors, are described in code analysis during the search process.
The general calculation of the Normalization Factor in the index process is as follows:
It includes three parameters:
- Document boost: the larger the value, the more important this Document is.
- Field boost: the larger the Field, the more important the Field is.
- LengthNorm (field) = (1.0/Math. sqrt (numTerms): The larger the total number of terms in a domain, that is, the longer the document. The smaller the value, the shorter the document, and the larger the value.
From the above formula, we know that a Term appears in different documents or in different domains, with different standardization factors. For example, there are two documents, each of which has two fields. If you do not consider the document length, there are four types of arrangement and combination. In the important domain of important documents, in the non-important domain of important documents, in the important domain of non-important documents, in the non-important domain of non-important documents, four combinations have different standardization factors.
Therefore, in Lucene, the Standardization factor saves (the number of documents multiplied by the number of domains) in the following format:
- Standardization Factor File: nrm ):
- NormsHeader: string "NRM" Plus Version, depending on the Lucene Version.
- Next is an array with the NumFields size. Each Field item is a Norms.
- Norms is also an array with a SegSize, that is, the number of documents in this segment. each item is a Byte, indicating a floating point number, where 0 ~ 2 is the ending number, 3 ~ 8 is the index.
4.3.2. delete a document file (del)
- Deleted Document File:. del)
- Format: in this file, Bits and DGaps can only save one of them.-1 indicates saving DGaps, and non-negative value indicates saving Bits.
- ByteCount.
- BitCount: the number of Bits to 1 indicates that this document has been deleted.
- Bits: the byte of an array. Its size is ByteCount. It is considered as byte * 8 Bits during application.
- DGaps: if the number of deleted documents is small, most Bits are 0, which is a waste of space. DGaps uses the following methods to store sparse Arrays: for example, the tenth, 12th, and 12th documents are deleted, so the tenth, 12th, and 12th bits are set to 1, and DGaps are also measured in bytes, only store byte not 0, such as 1st bytes, 4th bytes, 1st bytes in decimal format 20, and 4th bytes in decimal format 1. Therefore, it is saved as DGaps with 1st bytes. Position 1 is saved as an uncertain positive integer, value 20 is saved as binary, 2nd bytes, and position 4 is saved as an uncertain positive integer, the difference value is 3, and the value 1 is saved in binary format. The difference value is not used for binary data.
V. Overall Structure
- The figure below shows the overall structure of the Lucene index file:
- It belongs to the segment of the entire Index. gen, segment_N, which stores the metadata information of segments, and stores data information in multiple segments. The same segment has the same prefix file name.
- For each segment, including domain information, word information, and other information (standardization factor, delete document)
- The domain information also includes the metadata information of the domain. In fnm, the data information of the domain is in fdx and fdt.
- Word information is reverse information, including the dictionary (tis, tii), document number and Word Frequency inverted table (frq), and word position inverted table (prx ).
You can read the source code and the corresponding Reader and Writer to understand the file structure, which will be more thorough.