Improvement of Apache Lucene 2.9

Source: Internet
Author: User

Most of Lucene 2.9 focuses on performance optimization. This is reflected in the improvements from low-end internal infrastructure to index management methods. Lucene's index database is composed of a series of separated "fragments", each of which is stored in an independent file. When you add documents to an index, you will create new parts that can be merged. Lucene caches field information in fieldcache. It does not overhead loading field caching in Lucene 2.4 or earlier versions, especially in version 2.4, the entire field cache is constantly reloaded. During preparation for the release of version 2.9, The Lucene team realized that the change frequency of fragments during merging or deletion is usually relatively high, but earlier fragments tend to remain unchanged. Therefore, the modified field cache only loads updated parts.

The efficiency of loading fieldcache across Lucene fragments is not good. Therefore, version 2.9 manages fieldcache for each clip separately to avoid loading fieldcache across fragments. The effect of this change is very obvious. Mark Miller of lucid imagination runs a simple performance test, indicating that in the case of 5,000,000 different strings, compared with version 2.9, Lucene 2.4 delivers a performance improvement of about 15 times:
Lucene 2.4: 150.726 s
Lucene 2.9: 9.695 s

Another notable performance improvement lies in re-opening search. Lucene 2.9 introduces a new indexwriter. the getreader () method can be used to search for the current complete index, including the changes that have not been submitted in the current indexwriter session, which brings near real-time search capabilities. In addition, you can call indexwriter. setmergedsegmentwarmer () to "push" the fragments so that they can be immediately put into use.

Another major change is the way numbers are processed, especially in a Range Query (for example, "show me a CD with a price between 0.5 and 9.99. Prior to version 2.9, Lucene's query was completely text-based, so the processing of numbers became string-based Precise encoding. This method often generates a large number of independent keywords. Lucene needs to traverse to build the entire result set. Previously, many developers used custom encoding rules to deal with this situation, but Lucene 2.9 already comes with a method for dealing with numbers. The field and query classes index and search with appropriate precision, which greatly reduces the number of keywords to be searched and significantly improves the query response capability.

Version 2.9 also introduces a new query type and a wider range of keyword queries (wildcard, prefix, etc.), as well as a new analyzer for Persian, Arabic, and Chinese. In addition, this update also includes better Unicode support, a new query and analysis framework, and query of geographical locations, it allows filtering and sorting documents based on distance information (for example, "finding all dry cleaners in my home 5 miles "). You can find the complete improvement list here.

In general, beie will maintain the compatibility of the primary node, but the "backward compatibility policy" section of changes.txt lists the compatibility damages caused by Lucene 2.9 in many places. For version 2.9, the upgrade operation may require a re-compilation, suitable for a complete regression test and other efforts in this regard. Version 2.9-based re-compilation will also prompt all methods to be discarded, so that developers can upgrade their applicationsProgramAnd prepare for version 3.0. This is a wise practice, because Lucene 3.0 will discard support for Java 1.4 and delete all features marked as "deprecated" in version 2.9.

Apache Lucene 2.9 released

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.