Lucene 4.x inverted index Principle and Implementation: (3) term dictionary and index file

Source: Internet
Author: User

The most complex part is the term dictionary and term index files. The suffix of the term dictionary file is Tim, and the suffix of the term index file is tip.

 

The term dictionary file is first a header and then a postingsheader. The two formats are consistent, but different information is saved. Skipinterval is the hop range of the hop table, maxskiplevels is the number of layers in the hop table, skipminimun is the minimum inverted table length of the Application hop table, And the next part is the term.

In the TIM file, the term is divided into blocks for storage. To block the term, you must work with the tip file. The term index file stores an fstindex for each field to help you quickly locate the position of the term in the field in the TIM file. Due to the length of the fstindex, In order to quickly locate the location of a field, the pointer list rule is applied to save the fstindex pointer to the field for each field.

What is confusing here is: what is FST and how to use it for segmentation?

Fst is the finite state transducers, and is a finite state machine with output. You can see from the previous Finite State Machine rules that the finite state machine logic is a tree, like the tree in Figure 3-71, enter character a in the initial state to a, character B in the input state to B, and character d in the input state to D, the difference is that State D has an output. The so-called output is a pointer pointing to the position in the TIM file.

The block of the term in the TIM file is FST. In Figure 3-71, all the terms in Block 0 are prefixed with Abd, all the terms in Block 1 are prefixed with Abe. Each block has a block header, which indicates that the block contains several terms. Assume that the number is N and suffix contains N suffixes, for example, if Block 0 contains term "Abdi" and "abdj", save "I" and "J ". Stats contains N statistical information, each of which includes docfreq and totaltermfreq. Metadata contains pointers to the inverted table file frq and PRx.

The TIM and tip files are written by Org. apache. lucene. codecs. blocktreetermswriter is responsible for generating two outputstreams in its constructor and writing all information except block and fstindex.

Lucene40postingswriter's start function is as follows:

Next we will discuss in detail how the term is segmented, how the block is written, and how the fstindex is constructed.

First, let's take a simple example to see how a common Fst is constructed. The Lucene document provides an example similar to the following.

Here, inputvalues is the input to construct the FST. It is based on these strings to construct the tree in Figure 3-71.

Outputvalue is the output of a finite state machine. In actual application, output is a pointer to the TIM file, which is generally of the byte [] type, so here we also get three byte [] as the output.

Builder is the constructor of the finite state machine. It supports multiple output types. Here we use byte [] as the output, So we select bytesref as the output type, which is an encapsulation of byte.

The next step is to use the Add function of builder to associate the input and output. Because the input of builder must be of the intsref type, it is necessary to convert the string to the intsref type, the output also encapsulates byte [] as bytesref.

The builder's finish function constructs an FST and forms a binary structure in the memory. It can be used to quickly query and output data through input, for example, if "ACF" is input in the program, the output is [5 6].

From the perspective of surface phenomena, we can even decide that Fst is a hash map that provides input and output. This satisfies the requirement of being a term dictionary and provides a string. I can immediately find the position of the inverted table.

A very important member variable uncompilednode in builder <t> [] frontier. During FST construction, it maintains the entire FST tree, in which uncompilednode is directly saved, is the status node formed by the currently added string, and the status node formed by the previously added string references each other through pointers.

The builder. Add function consists of four parts:

After the first string Abd is added, the frontier structure is 3-72, and the blue nodes in the figure are all.

 

After the new string Abe, first (1) Find out the common prefix AB, then prefixlenplus1 = 3. Then call (2) use freezetail to ice the last node SD. Why do I need to ice (an image? Because the SD node will not change. In practice, strings are processed alphabetically. The last string is Abd, the next string may be abdm, And the next string may be abdn, this will cause changes to the SD node. However, when Abe appears, it means that Abd * cannot appear, and State SD cannot have new subnodes. Therefore, SD is determined and needs to be frozen. Do I need to ice the SB node? Of course, this is not the case. This is because new Sb subnodes, such as abf and ABG, may appear next time. This is why we need to calculate the public prefix, status nodes after the public prefix can be frozen, and these frozen nodes start from the tail. Therefore, the function of this step is freezetail.

The implementation of freezetail is as follows:

Freezetail has two main branches. During builder construction, the user can pass in his own freezetail. If the User specifies it, he will call its freeze function. If it is not specified, the default else behavior is executed. Here, we use the default behavior. In the code analysis below, we can also see the use of our own freezetail.

By default, the compilenode function is called for each status node from the tail node to the public prefix node. Before that, all uncompilednodes are saved in the frontier. After the compilenode function is used, the node becomes compilednode and is removed from the frontier, parent. the replacelast function points the pointer of the parent node to the new compilednode. The compile process turns the data structure in the memory into binary.

Compilenode eventually calls org. Apache. Lucene. util. FST. FST. addnode (uncompilednode <t>). The Code is as follows:

Then (3) Add the new input to the frontier and change it to a data structure of 3-73.

 

And so on. After ACF is added, frontier becomes the following data structure.

 

Finally, call the finish function of builder to generate the FST. The Code is as follows:

The resulting binary array 3-75 is shown. Due to content flip, the resolution needs to be parsed from right to left.

 

After learning about the basic FST principle, let's use the code step by step to understand how the block and fstindex of the Tim and tip files are generated.

We use 3-76 as an example. By default, blocktreetermswriter has two static variables: default_min_block_size = 25, default_max_block_size = 48. Min indicates that when the number of subnodes of a State node exceeds 25, you can write a block. Max means that when the number exceeds 48, multiple blocks are written, and multiple blocks constitute a hierarchical block. To clearly parse the code, we set default_min_block_size = 2 and default_max_block_size = 4. We only add one document, in which the term is ABC abdf abdg abdh abei abej abek Abel abem Aben. The status tree is written into a block, I, J, K, L, M, N into a hierarchical block according to Min and Max settings, c, D, and E are written as a block. The reason why we put the decimal and hexadecimal columns from A to N here is that in eclipse, sometimes the characters are displayed in decimal, sometimes in hexadecimal, when you see these values, you only need to know these characters.

 

The process of writing Tim and tip files is complex. The following flowchart 3-77 serves as a clue.

 

Each time a new term comes, finishterm is called.

Blockbuilder of finishterm has no output. blockbuilder is used to block the term, rather than to generate fstindex. Blockbuilder. the process of the add function is basically the same as that in the FST basic principle described above. The difference is that blockbuilder is designated by the user for freezetail, Which is Org. apache. lucene. codecs. blocktreetermswriter. termswriter. findblocks, so freezetail calls findblocks. freeze function. The freeze function only processes nodes with a number of sub-nodes greater than min. The writeblocks function is called to write the sub-nodes as blocks. For nodes that do not meet this condition, only the frontier node is removed, do not perform other operations.

During the entire process, two member variables are maintained. One is list <pendingentry> pending, which saves the unprocessed term or block. For the term, the text, docfreq, and totaltermfreq information of the term are saved. The other is pendingterms, which saves the freqstart and proxstart information of the unprocessed term.

After ABC, abdf, abdg, and abdh are added, frontier becomes the following structure. In this process, findblock. Freeze does nothing. Pending and pendingterms at this time.

 

When the SD is freeze, when the SD is added to abei, it is found that the SD outbound degree is 3, greater than min, then the blocktreetermswriter. termswriter. writeblocks (intsref, Int, INT) function is called.

Since the degree of output is smaller than Max, it is written as a non-Floor Block.

The function for writing a block is as follows:

For each written block, an fstindex is generated for the block. This process is implemented by the blocktreetermswriter. pendingblock. compileindex function.

The block is also written, and fstindex is also generated. The results of frontier, pending, and pendingterms are shown in.

 

Here we need to explain how the ing relationship [-] In the fstindex of the block: Abd is obtained? This is calculated by the following function. Fp = 86, hasterm = true, isfloor = false, then the binary 101011010, indicates that the VINT is 11011010,000 00010, Which is [-]. In fact,-38 is a complement.

After abei, abej, abek, Abel, abem, and Aben are added, the results of frontier, pending, and pendingterms are shown in 3-80.

 

After all the terms are added, blocktreetermswriter. termswriter. Finish is called.

When calling freezetail (0), we still call findblocks. the freeze function, in the freeze state se, has a degree of 6> min, so writeblocks is called. Because of 6> MAX, it is written to the floor block.

The functions written into firstblock and floorblocks are the writeblock function called when the non-Floor Block is written above. The values of some major variables are listed below.

Shows the results of frontier, pending, and pendingterms after a hierarchical block is written and fstindex is generated.

 

What does [-1,107, 3, 33] mean? First, Abe points to a hierarchical block. The starting address of firstblock is 108, FP = 108, hasterm = true, and isfloor = true. The binary value is 110110011, indicates that the VINT is [10110011,000, 00011], [-, 3], followed by floorblock information.

In the blocktreetermswriter. pendingblock. compileindex function, there is such a section:

Then, the number of floorblocks written is 1. Then write the first character K (107) of the floorblock ). The difference between the first address of floorblock and the first address of firstblock is as follows: Sub. fp = 124, FP = 108, sub. hasterms = true, so it is 33. Therefore, the output of [Abe] is [-1,107, 3, 33].

After the freeze state se, the following should be freeze state Sb, and its outbound degree is 3, so the writeblock is called to write to a non-Floor Block first, then, compileindex is called to generate a new fstindex for the block.

When writing data to a block, some important variables are shown in the following table.

Table 3-17 writeblock variables in the freeze state sb

 

When compileindex is used to generate the fstindex of the current block, in addition to the output corresponding to prefix = AB, The fstindex of the sub-block, block: Abd, and block: Abe are also added, form an integer fstindex.

Shows the frontier, pending, and pendingterms results after freeze completes sb.

 

Here, only one pending item exists. The fstindex of all sub-blocks is merged into the block: AB, and the output of a [AB] is [-30,4], which is determined by FP = 152, hasterm = true, isfloor = false encoded.

Next, for the status SA, the outbound degree is 1 and nothing is done. For the initial state S0, the outbound degree is also 1, which is nothing to say, but in the findblocks. Freeze function, there is such code:

Besides determining whether the degree is greater than min, idx = 0. For S0, you still need to call writeblocks to write block: AB to Tim.

Blocktreetermswriter. termswriter. Finish function blockbuilder. Finish. Next, from pending. get (0) to get the fstindex of the root node. Because in compileindex, fstindex of all child nodes is added to the parent node, the fstindex of the root node is the fstindex of the entire state machine, then write it into indexout, that is, the tip file.

Finally, the format and relationship of block and fstindex in the tip and Tim files are 3-83.

 

Finally, let's take a look at the binary content of fstindex, as shown in Figure 3-84.

Lucene 4.x inverted index Principle and Implementation: (3) term dictionary and index file

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.