A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
In the previous articles about Lucene, I have briefly explained how to use Lucene for word segmentation, indexing, and search. Recently, most of the time has been spent on data query. It seems complicated, but it is always not in-depth enough. Fortunately, most of them are conceptual things that do not affect programming practices. Sometimes I feel powerless myself, too concerned about things that are red-hot on the surface, and there is a risk of getting curious, passionate, and diligent in practice, after all, what I personally need to focus on is to solve the problem, rather than knowing a few concepts and professional terms. I have to do something first, although this is not very professional. This article briefly records several parameters that are useful for creating indexes in Lucene and the perceptual knowledge and summary of several common formats in index files. I hope this will help you as well.I. Several useful parameters for improving the indexing speed
In the indexAlgorithmUnder certain conditions, the three parameters that most affect Lucene index speed are mergefactor, maxmergedocs, and rambuffersizemb in indexwriter. These parameters are mainly used to control the internal and external memory swap and index merge frequency, so as to improve the index speed. Of course, the specific settings of these parameters are also inseparable from the hardware conditions of the machine.1. mergefactor
Mergefactor is the so-called "merging factor". It is mainly used for merging sub-indexes (segment. As you know, indexes in Lucene are written to the memory first, and certain restrictions are triggered before being written to the hard disk to generate an independent sub-index (segment ). Generally, multiple sub-indexes are merged into one index after optimization (optimize (). Otherwise, many sub-indexes will affect the retrieval speed, in addition, the occupied disk space may be very large. The mergefactor parameter is used to control the number of sub-indexes in a hard disk, and Lucene needs to combine these sub-indexes into a larger index. I am not very clear about the internal implementation details of the merge, but the default index merging mechanism in Lucene is not a two-to-one merge. Generally, multiple segments are merged into a large index at a time, therefore, the larger the mergefactor, the more memory consumption, and the faster the indexing speed.
Indexwriter writer = new indexwriter (directory, analyzer, true, indexwriter. maxfieldlength. limited); console. writeline (writer. getmergefactor (); // The default value of mergefactor is 10 writer. setmergefactor (30); console. writeline (writer. getmergefactor (); // After setting, the current value is 30
The default value of mergefactor is 10. You can set a reasonable merging factor value to accelerate index construction.2. maxmergedocs
The merge factor mergefactor starts with segment optimization performance, while maxmergedocs, after reading the name, knows that it starts to improve performance from the document that forms the sub-index. The maxmergedocs parameter determines the number of documents written into the memory index. After this number is reached, the memory index is written to the hard disk to generate a new sub-index segment file, therefore, this parameter is equivalent to a memory buffer. In general, the larger the index, the faster the speed.
It should be noted that the maxbuffereddocs parameter is disabled by default, because Lucene also uses another parameter (rambuffersizemb) to control the number of index documents of this bufffer. In fact, maxbuffereddocs and rambuffersizemb can be used together. If one of the trigger conditions is met, the data is written to the hard disk and a new sub-index segment file is generated.
Indexwriter writer = new indexwriter (directory, analyzer, true, indexwriter. maxfieldlength. limited); console. writeline (writer. getmaxmergedocs (); // The default value of maxmergedocs is 2147483647 (that is, Int. maxvalue) writer. setmaxmergedocs (1024); console. writeline (writer. getmaxmergedocs (); // After setting, the current value is 1024
The default value of maxmergedocs is the constant Int. maxvalue (2147483647 ).3. rambuffersizemb
As you know, this parameter acts like maxmergedocs to control the maximum memory used to cache index documents. If the number of buffer index documents reaches this limit, the files will be written to the hard disk. In general, the larger the parameter, the larger the memory usage, the faster the index speed.
Indexwriter writer = new indexwriter (directory, analyzer, true, indexwriter. maxfieldlength. limited); console. writeline (writer. getrambuffersizemb (); // rambuffersizemb the default value is 16 writer. setrambuffersizemb (1024); console. writeline (writer. getrambuffersizemb (); // After setting, the current value is 1024
The default value of rambuffersizemb is 16.
Summary: Based on the above analysis, we know that setting the values of the three parameters can make full use of the memory (theoretically, the larger the parameter value, the more memory utilization, but must be combined with the actual machine performance) to avoid frequent Io operations and improve the indexing speed.Ii. Several common formats of index files
Here I mainly refer to this article and this very patient and detailed article. Although I have been a member of my favorites for a long time, I haven't fully understood the relationship between files until now. It seems that it is still very rewarding. Attitude determines everything. I will re-write a more detailed study note if I have time. Below we will record several file formats that are helpful for my actual programming.1.. gen format and segments_n
Each segment represents a complete index segment of Lucene. A file must exist after the index is created. One index can have multiple segments_n at the same time. To open an index, You must select a segments_n file. When you open an index and select a segments_n file, the determination of the segments. gen file is closely related to the N (generation) of segments_n (for specific logic, refer to 4.1.1 In this article ).2. CFs format
Composite index file format. We know that the index content may be very large and the number of files may be very large. In this case, the number of files opened by the system will be large, which will greatly consume system resources. Therefore, Lucene provides a single file index format, the so-called composite index format. To store document content in composite index format, you only need to use the setusecompoundfile (Boolean) method after an indexwriter object is initialized. This file will be available after usecompoundfile is set to true.
Indexwriter writer = new indexwriter (directory, analyzer, true, indexwriter. maxfieldlength. limited); console. writeline (writer. getusecompoundfile (); // The default value is true writer. setusecompoundfile (false); console. writeline (writer. getusecompoundfile ());
By default, indexwriter sets usecompoundfile to true.3. Lock format
As the name implies,. Lock is the lock file type. This lock file is invisible to the index file after normal optimization. When you insert, modify, or delete an index file through indexwriter or indexmodifier, A. Lock lock file is generated. In libraries of earlier versions of Lucene, it is easy to encounter exceptions by modifying indexes through multiple threads, usually because of the lock file. When talking about the lock mechanism, it is necessary to understand Lucene's concurrency rules:
|run multiple parallel search processes on the same index||Yes|
|run multiple parallel search processes on an index that is being generated, optimized, or merged with another index, or when the index is being deleted or updated, multiple parallel search processes are run on the index.||Yes|
|Add and update documents with multiple indexwriter objects for the same index||NO|
|when the indexreader object of a document deleted from the index is not closed successfully, open an indexwriter object to add a new document to this index||NO|
|after the indexwriter object adds a new document to the index, it is not closed, open an indexreader object to delete the document from this index||NO|
We can see that this concurrency rule has no restrictions on the search, but other operations on the index may have an impact on the search. For example, indexes not optimized after the update may affect the search speed, the search accuracy is also unreliable. Fortunately, in our actual project, there will be certain policies to use indexes. For example, duplicate the same index for two copies, one for search, the other for addition, deletion, modification, and optimization, and then switch regularly.3. FNM format
Contains the names of all fields in document.4.. fdx and. FDT formats
The. FDT file is used to store information about fields with the field. Store. Yes attribute. The. fdx file is an index used to store domain information.5. tii and. Tis formats
. Tii stores the index file of the term after word segmentation, indicating the position of the entry in each. Tis file.6. deletable format
In Lucene indexes, all documents are not immediately removed from indexes after they are deleted, but are not deleted until the next time the index is merged or optimized. This function is implemented through the deletable file. After all documents are deleted, a record is first left in the deletable file. The index is removed only when the file is deleted.
A concise image circulating on the internet is attached to describe the relationship between common indexes and their file formats:
Start building with 50+ products and up to 12 months usage for Elastic Compute Service