Lucene optimization measures and research inspiration

Source: Internet
Author: User
. 1 Index Process Optimization

Index Average score2One is small batch index expansion, and the other is large batch index reconstruction. During the indexing processDocWhen the index is added, the index file is written again (FileI/OIs a very resource-consuming thing ).

LucenePerform Indexing in the memory and write files in batches. The larger the interval for this batch, the less files are written, but the memory is occupied. On the contrary, the memory usage is small, but the fileIoFrequent operations and slow indexing. InIndexwriterThere isMerge_factorParameters can help you fully utilize the memory to reduce file operations based on the application environment. Based on my experience: DefaultIndexerYes20Each record index is written once.Merge_factorAdd50Times, index speed can be improved1Times or so.

4.4.2Search Process Optimization

LuceneSupports memory indexes: such searches are better than file-basedI/OThe speed is increased by an order of magnitude. MinimizeIndexsearcherIs also necessary to create and cache the search results.

LuceneThe Optimization for full-text search is that after the first index search, not all records (Document) The specific content is read, and only the headers with the highest matching degree in all results are retrieved.100Results (Topdocs)IDPut it in the result set cache and return it. here we can compare the database search: if it is10,000The database must retrieve all records before returning them to the application result set. So even if the total number of searches matches is large,LuceneThe result set does not occupy much memory space. For general fuzzy search applications, so many results are not used.100Items can already be met90%The above search requirements.

when the first batch of cached results are used up, you need to searcher retrieves and generates a cache with a maximum of 1 times, and then re-capture. Therefore, if you construct a searcher query 1 - 120 results, searcher times of searching: header 100 after the entries are retrieved, the cache results are used up. searcher re-search and construct a 200 result cache, and so on. 40 0 cache entries, 800 cache entries. Since each searcher object disappears, these caches cannot be accessed, you may want to cache the result records. The number of cache records should be 100 below, we will make full use of the first result cache to prevent Lucene from wasting multiple searches, in addition, results can be cached hierarchically.

4.5ResearchLuceneInspiration

LueneIs a model of object-oriented design. It is mainly manifested in the following aspects:

All problems can be easily extended and reused through an extra abstraction layer: you can achieve your goal through re-implementation, instead of requiring other modules;

Simple Application PortalSearcher, IndexerAnd call a series of underlying components to complete the search task collaboratively;

Tasks of all objects are very specific, such as the search process:QueryparserAnalysis converts a query statement into a combination of a series of precise queries(Query ),Read the structure through the underlying indexIndexreaderReads the index and scores the search result with the corresponding scorecard./Sort. All functional modules are very atomic, so you can implement them again without modifying other modules.

In addition to flexible application interface design,LuceneIt also provides some Language Analyzer implementations suitable for most applications (Simpleanalyser, standardanalyser), Which is one of the important reasons for new users to quickly get started. These advantages are worth learning from in future development. As a general tool kit,LuneceIt is indeed convenient for developers who need to embed the full-text search function into the application.

In addition, through learning and using Lucene , I also deeply understand why many database optimization design requirements are required, for example, if you want to index fields to improve the query speed, too many indexes will update the database table operations are slow, and sorting conditions with too many results are often the performance killer. Many commercial databases provide some optimization parameters for large volumes of data insertion operations, merge_factor The query results are similar. The query results are not of good quality. Especially for a large returned result set, how to optimize the quality of the first few results is always the most important; try to get a small result set from the database, because even for large databases, random access to the result set is a resource-consuming operation.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.