Lucene.Net (4.8.0) Learning problem record three: optimization of index creation IndexWriter and index speed

Source: Internet
Author: User

Foreword: At present oneself in do use lucene.net and Pangu participle to realize full-text search work, but oneself is the project that others do well to carry on the migration. Because the project is migrated to the ASP. NET Core 2.0 version, and Lucene uses the version 3.6.0, Pangu participle is also corresponding to the Lucene3.6.0 version. But fortunately Lucene.Net already has the Core 2.0 version (4.8.0 bate version), and Pangu participle, currently someone is doing, seems to have been done, but have not tested ~,lucene upgrade changes I will add bold expression.

Lucene.Net 4.8.0

Https://github.com/apache/lucenenet

Pangu participle

https://github.com/LonghronShen/Lucene.Net.Analysis.PanGu/tree/netcore2.0

Lucene.Net 4.8.0 and before the lucene.net 3.6.0 change is quite a lot of, here on their own development process encountered problems, make a record it, hope can help and I need to upgrade lucene.net people. I am also the first to contact Lucene, but also hope to help beginners to the students of Lucene.

One, Lucene CREATE index: Introduction of Indexwriter1.indexwriter

The IndexWriter is used to create and maintain indexes. IndexWriter creation: In Lucene4.8.0, create IndexWriter object, need to use indexwriterconfig parameter, indexwriterconfig to set some IndexWriter properties :

Newnew indexwriter (dir,_indexwriterconfig)

The above code creates a basic IndexWriter object, and each basic indexwriter must have two necessary properties: 1. The index directory of the Operation dir; 2. Word breaker analyze. It is important to note that the IndexWriter word breaker and the Indexsearch word breaker should be the same, otherwise it will affect the search results.

We can set the properties of IndexWriter through Indexwriterconfig, we have reached the requirement that we want to build the index, and here are some properties that can affect the speed at which IndexWriter writes the index:

Indexwriterconfig.setrambuffersizemb (double); Indexwriterconfig.setmaxbuffereddocs (int  ); Indexwriterconfig.setmergepolicy (Mergepolicy)

SETRAMBUFFERSIZEMB () is set when IndexWriter adds a document that is larger than RAMBUFFERSIZEMB, and IndexWriter writes the operation in memory to the hard disk. Specifically: IndexWriter performs adddocuments (writing the document), deletedocuments (deleting the document), updatedocuments (updating the document), and these operations are slowed down into memory, In other words, after executing these functions, there is no change in the stored index directory, and when the adddocuments capacity exceeds the above attributes, these operations will be executed to the hard disk where the index is stored. The defaultDEFAULT_RAM_BUFFER_SIZE_MB 是16MB.

Setmaxbuffereddocs () is set, when IndexWriter adds more documents than Maxbuffereddocs, IndexWriter writes the in-memory document to the hard disk, and generates a new index file, segment. The index structure of Lucene is mentioned below.

Setmergepolicy is a policy for setting up an index merge, and Mergepolicy has a parameter default_max_cfs_segment_size represents the maximum number of SEGMENT files in the index.

1.1 Increase the speed of the index

The above mentioned three properties of three indexwriterconfig. We know that IndexWriter is only starting to write operations from the cache to the hard disk when the capacity in the cache reaches a certain limit, in fact, the faster the index is, the higher the value we set the limit. It is obvious that if you set RAMBUFFERSIZEMB and MAXBUFFEREDDOCU larger, the fewer times indexwriter write to the hard disk, the more time it takes to write the index on the hard disk.

After the IndexWriter is written to the index, there will be many segment files in the index directory. When the number of segment files reaches mergefactor (set merge Factor) , IndexWriter merges the segment files to form a new segment file, similar to compression. In the index directory, if the segment file is more, the search will slow down, and the fewer segement files, the faster the search will be. So when we set the value of Mergefactor, the faster the search, the faster the merging segement, or the slower the index.

2. Structure of the index file

This is an index file under an index directory. The structure is this:

Index---(segment) Segment---(document) document---(Field) field---(word) term

In the above picture, there is only one segment, _V6.FDT; _V6.FDX .... All belong to the content of _V6 segment. segments_5u and Segments.gen are metadata files for segments, meaning that they hold the attribute information for a segment.

    • XXX.FNM saves how many domains this segment contains, the name and index of each domain.
    • XXX.FDX,XXX.FDT saves all the documents contained in this section, how many fields each document contains, and what information is saved for each domain.
    • XXX.TVX,XXX.TVD,XXX.TVF saves how many documents this segment contains, how many fields each document contains, how many words each field contains, the string of each word, and the location of the information.

Above is the forward information, as well as the reverse information is not detailed to say.

Optimization of 3.IndexWriter

In Lucene, indexwriter.optimize is used to optimize the index, and Optimize has been renamed Forcemerge in Lucene4.8.0 so that you can use it less. IndexWriter optimization is actually the segment file to merge, you can input parameters, Forcemerge (segments), the merge into the index directory of up to segments pieces of files. When the parameter is smaller, the more files are merged, the more time and space are consumed. Obviously, merging is to make our search faster.

In the process of optimization, you need twice times the current index capacity, such as your index size is 40 g, in the optimization process, the size of the index will increase to more than 80 g, and then merge until the end of more than 30 g. When your index update is not particularly frequent, you can optimize, if the update is particularly frequent, then call Forcemerge is inefficient, this time, we can set the above mentioned Mergefactor, so that the index segments file less.

Precautions for 4.IndexWriter

1.IndexWriter creates a lock file when manipulating an index, writer.lock. If there is another indexwriter to open this directory, an error will be found.

2.IndexWriter instances are fully thread-safe and can be called by multiple threads at the same time by any method.

Lucene.Net (4.8.0) Learning problem record three: optimization of index creation IndexWriter and index speed

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.