Improvements to ipve3.x

Source: Internet
Author: User
Document directory
  • 1. index file Improvement
  • 2. re-search
  • 3. Digital Processing
  • 4. Other Optimizations
  • 1. Create an index
  • 2. Query
December 10, April 20-hikrock
I. Overview

Lucene3.0 (hereinafter referred to as 3.0) was released on. Version 3.0 is a major version with great changes. I have made a lot of adjustments on the API, deleted many discarded methods and classes, and supported many new features of Java 5: including generics, variable parameters, enumeration, and autoboxing.

Therefore, this version is incompatible with version 2.x. To use version 3.0, it is best to use it in a new project instead of upgrading version 2.x or earlier!

Ii. Version 2.9

Because the new version has changed a lot, it is not recommended to upgrade from the old version to the new version. Because the changes will be large.
In fact, in version 2.9, the changes are great, because version 2.9 is prepared for version 3.0, but in order to be backward compatible, version 2.9 does not abandon the old method, so it can be directly backward compatible. Version 2.9 is mainly used to optimize the performance, including the internal structure improvement and index management methods of Lucene on the underlying layer.

1. index file Improvement

Lucene's index data is stored in independent files, which are the "fragments" that store column separation in the index database ". When we want to add documents to the index, we will constantly create new fragments that can be merged, because the overhead of reading and writing files is large, therefore, Lucene does not directly add the field information to the index file every time. Instead, it caches the field information and writes it to the file again after a certain amount. After 2.9, Lucene manages fieldcache for each clip to avoid loading fieldcatch across fragments. This solves the problem of inefficient loading fieldcatch across fragments in Lucene, this change greatly improves the performance. Lucid
Mark Miller of imagination runs a simple performance test, indicating that Lucene will get a performance improvement of about 15 times over version 5,000,000 in the case of 2.4 different strings: Lucene 2.4: 150.726 s Lucene 2.9: 9.695 s

2. re-search

The new version introduces the indexwriter. getreader () method, which can be used to search for the current complete index, including the changes that have not been submitted in the current indexwriter session, which brings close to the real-time search capability. In addition, you can call indexwriter. setmergedsegmentwarmer () to "push" the fragments so that they can be immediately put into use.

3. Digital Processing

Versions earlier than version 2.9 are based on text search, because it is a headache for processing many numbers, for example, many problems encountered in our project are caused by the bug that numbers are treated as text: 1. Search price 5. include. 5 is also found; 2. When sorting (descending), 800 is ranked before 5000 ;...... These are all problems caused by Lucene's use of all of them as text processing methods. Lucene 2.9 and later have provided the processing method for numbers. The field and query classes index and search with appropriate precision, which greatly reduces the number of keywords to be searched and significantly improves the query response capability.

4. Other Optimizations

A new query type and a wider range of keyword queries (wildcard, prefix, and so on) are introduced, as well as a new analyzer for Persian, Arabic, and Chinese. In addition, this update also includes better Unicode support, a new query and analysis framework, and query of geographical locations, it allows filtering and sorting documents based on distance information (for example, "finding all supermarkets in my home 5 km ").

Iii. Comparison between version 2.9 and version 3.0

Although 2.9 is a version prepared for 3.0, compared with 3.0, 2.9 has a relatively large change, which should be reflected in:

  1. 1. 3.0 abandoned the method declared in 2.9, so 3.0 is not backward compatible;
  2. 2. 3.0 gave up support for java1.4 and changed to support for later versions of java1.5 and ant 1.7.0;
  3. 3. Some other kernel changes, such as oallock. islocked (), which throws an ioexception and changes to some static variables.
Iv. change of the main method in 3.0

Here we will talk about the differences between creating indexes and searching in the latest version.

1. Create an index

The new version discards many unused methods when creating indexes. For details, see that all indexwriter constructors declared to be abandoned are deleted in 3.0.

Indexwriter constructor 3.0:

When an index is added, the constant of each field also changes, as shown in the following code:

2. Query

Queryparser (F String, analyzer parser = new queryparser (version. lucene_current, field, new standardanalyzer (version. lucene_current); query = parser. parse (Q); topscoredoccollector collect topscoredoccollector. create (100); indexsearcher is = new indexsearcher (fsdirectory. (Open file), true ). Is. Search (collector );
Scoredoc [] file = collector. topdocs () scoredocs (I = 0; <docs. length; I ++) {file Doc = DOC (document [Me]. Doc); // new is.doc () system. Out. println (Doc. getfield ("name") + "" + document [I]. Tostring () + ");}

[/Code]

3.0 search constructor:

Constructor before 3.0:

V. 3.0 overall Diagram

Compared with the previous version (before 3.0), the structure of version 2.9 only shows an additional message package in the program structure, which is used to handle internationalization.

As you can see, 3.0 is the same as the previous version. It consists of eight modules (package), which are encapsulated by external interfaces, index core, and infrastructure. For details, see Appendix 1. We can also see the call relationship during Lucene search: when we want to query a word, the query module (Search) will first call the syntax analyzer (queryparser) to analyze the query statement, the syntax analysis module calls the lexical analyzer (analysis) for lexical analysis, such as Word Segmentation and filtering for search keywords. the lexical analyzer calls the message module according to the actual situation) for some international processing. After these preparations are completed, the search core is truly entered. The index module (INDEX) is called to read the data in the index file from the underlying storage class (store, then return to the query module. Other modules exist as public classes throughout the search process.

Address: http://www.ourys.com/post/lucene3-0_about.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.