I read a foreign article on the Internet, which introduced the tips for improving Lucene indexing speed and shared it with you. First, let's take a look at the main factors that affect the index: MaxMergeDocs this parameter determines the number of index documents written into the memory. When this number is reached, the memory index is written to the hard disk to generate a new index segment file. Therefore, this parameter is a memory bu.
I read a foreign article on the Internet, which introduced the tips for improving Lucene indexing speed and shared it with you. First, let's take a look at the main factors that affect the index: MaxMergeDocs this parameter determines the number of index documents written into the memory. When this number is reached, the memory index is written to the hard disk to generate a new index segment file. Therefore, this parameter is a memory bu.
I read a foreign article on the Internet, which introduced the tips for improving Lucene indexing speed and shared it with you.
First, let's take a look at the main factors that affect the index:
MaxMergeDocs
This parameter determines the number of documents written into the memory index. After this number is reached, the memory index is written to the hard disk to generate a new index segment file.
Therefore, this parameter is also a memory buffer. In general, the larger the index, the faster the speed.
The MaxBufferedDocs parameter is disabled by default, because Lucene uses another parameter (RAMBufferSizeMB) to control the number of index documents of this bufffer.
In fact, the MaxBufferedDocs and RAMBufferSizeMB parameters can be used together. If one of the trigger conditions is met, the data is written to the hard disk and a new index segment file is generated.
RAMBufferSizeMB
Controls the memory upper limit for buffer Index documents. If the number of buffer index documents reaches the upper limit, the files will be written to the hard disk. Of course, in general, the larger the index, the faster the speed.
When we are not sure about the document size, this parameter is quite useful and will not cause outofmemory error.
MergeFactor
This parameter is used for merging sub-indexes (Segment.
In Lucene, indexes are written to the memory. After certain restrictions are triggered, the indexes are written to the hard disk to generate an independent sub-index called Segment in lucene. In general, these sub-indexes need to be merged into an index, that is, optimize (). Otherwise, the retrieval speed may be affected and the open too program files may also be caused.
The MergeFactor parameter is used to control the number of sub-indexes in the hard disk, so we need to merge these indexes into a slightly larger index.
MergeFactor cannot be set too large, especially when MaxBufferedDocs is relatively small (more segment), otherwise it may cause open too program files errors, or even virtual machine external errors.
Note: The default index merge mechanism in Lucene is not a merge of two shards. It seems that multiple segments are merged into the final large index. Therefore, the larger the MergeFactor is, the more memory is consumed, and the faster the index is, but I feel too big, for example, 300. The final merge is still full. Batch indexing should be MergeFactor> 10
Online layout of Ziwei Doudou, a constellation on the 21st CenturyTips for accelerating indexing:
? Make sure you are using the latest Lucene version.
? Try to use the local file system
Remote file systems generally reduce the indexing speed. If the index must be distributed on a remote server, create an index locally and distribute it to the remote server.
? Use faster hardware devices, especially faster I/O devices
? Reuse a single IndexWriter instance during Indexing
? Use Flush based on memory consumption instead of Flush based on document quantity
In versions earlier than Lucene 2.2, you can call the ramSizeInBytes method after each document addition. When the index consumes too much memory, you can call the flush () method. This is especially effective when a large number of small documents are indexed or the document size is not fixed. You must set the maxBufferedDocs parameter to a large enough value to prevent the writer from being flushed based on the document quantity. But note, don't set this value too large, otherwise you will encounter a Lucene-845 BUG. However, this BUG has been fixed in Version 2.3.
Versions later than ipve2.3. IndexWriter can automatically call flush () based on memory consumption (). You can use writer. setRAMBufferSizeMB () to set the cache size. When you plan to flush Based on the memory size, make sure that the MaxBufferedDocs value is not set elsewhere. Otherwise, the flush condition will become uncertain (whoever meets the condition first will follow it ).
? Use more memory within the limits you can afford
Using more memory before the flush operation means that Lucene will generate a larger segment during indexing and reduce the number of merge operations. Test in the Lucene-843, about 48 MB memory may be a more appropriate value. However, your program may be another value. This also has a certain relationship with different machines. Please test more and select a trade-off value.
? Disable composite File Format
Call setUseCompoundFile (false) to disable the composite file option. Generating a composite file will consume more time (after Lucene-888 testing, it will probably increase by 7%-33% ). However, this will greatly increase the number of file handles used for searching and indexing. If the merge factor is large, you may use the file handle.
? Reuse Document and Field instances
In lucene 2.3, a method called setValue is added to allow you to change the value of a field. The advantage is that you can reuse a Filed instance throughout the indexing process. This will greatly reduce the GC burden.
It is best to create a single Document instance and then add the desired field to the Document. Reuse the Field instance added to the document at the same time, and call the corresponding SetValue method to change the value of the corresponding Field. Then add the Document to the index again.
Note: you cannot share one Field instance with multiple fields in a document. The Field value should not be changed before the document is indexed. That is to say, if you have three fields, you must create three Field instances, and then re-use them during the Document addition process.
? Use a single Token instance in your Analyzer
Sharing a single token instance in the analyzer also relieves GC pressure.
? Use the char [] interface in Token to represent data instead of the String interface.
In Lucene 2.3, Token can use a char array to represent his data. This avoids the consumption of string construction and GC collection. By using a single Token instance and using the char [] interface, you can avoid creating new objects.
? Set autoCommit to false
In Lucene 2.3, a large number of optimizations were made to documents with stored fields and Term vectors to save time for large index merging. You can set autoCommit of a single reusable IndexWriter instance to false to witness the benefits of these optimizations. Note that this will prevent searcher from seeing any index updates before IndexWriter is disabled. If you think this is important to you, you can continue to set autoCommit to true, or enable and disable your writer periodically.
? If you want to index many small text fields, we recommend that you combine these small text fields into a large contents field and then only index contents. (You can also store those fields)
? Increase the mergeFactor merging factor, but the bigger the factor, the better.
A large merging factor will delay the merging time of the segment, which can improve the indexing speed, because merging is a very time-consuming part of the index. However, this will reduce your search speed. At the same time, you may use up your file handle if you set the merging factor too large. If the value is too large, the index speed may be reduced, because this means that more segments will be merged at the same time, which will greatly increase the hard disk burden.
? Disable all functions that you are not actually using
If you store fields but do not use them during query, do not store them. The same is true for the Term vector. If you index many fields, disabling unnecessary features of these fields will greatly help increase the indexing speed.
? Use a faster Analyzer
It takes a long time to analyze documents with time. For example, StandardAnalyzer is time-consuming, especially before Lucene 2.3. You can try to use a simpler and faster analyzer that meets your needs.
? Document creation time
In general, the data source of a document may be external (such as databases, file systems, and crawlers crawling from a website), which are usually time-consuming, try to optimize their performance.
? Do not optimize indexes unless you really need it (only when you need a faster search speed)
? Share an IndexWriter in multiple threads
The latest hardware is suitable for high concurrency (multi-core CPU, multi-channel memory architecture, and so on). Therefore, using multithreading to add documents will greatly improve the performance. Even on a very old machine, concurrent addition of documents will make better use of IO and CPU. Number of concurrent threads for multiple tests to obtain a critical optimal value.
? Groups Documents on different machines for indexing and then merges them.
If you have a large number of text documents that need to be indexed, you can divide your documents into several groups, index different groups on several machines, and then use writer. addIndexesNoOptimize to merge them into the final index file.
? Run the performance test program
If none of the above suggestions are effective. We recommend that you run the performance check program. Find out which part of your program is time-consuming. This will usually surprise you.
Http://wiki.apache.org/jakarta-lucene/ImproveIndexingSpeed
Http://www.21kaiyun.com