Some basic usage and concepts of e.net

Source: Internet
Author: User
Lucene is an open-source project organized by Apache to implement full-text search engines using Java. Its functions are very powerful, but the API is actually very simple. The main thing is to do two things: index creation and search (For details, refer to the network Abstract)
1. The most important terms for creating an index

* Document: a unit to be indexed. It is equivalent to a database record. Any data to be indexed must be converted to a document object for storage.
* Field: a field in document, which is equivalent to the column in the database. field is a term with many Lucene concepts. For details, see the following section.
* Indexwriter: writes the document into the index file. In general, the indexwriter constructor includes the following three parameters: the path where the index is stored, the analyzer, and whether to re-create the index. Note that after indexwriter executes the adddocument method, you must call the close method to close it. Only when the close method is called will the index store all the content in it into the disk and close the output stream.
* Analyzer: analyzer, mainly used for text word segmentation. Commonly used include standardanalyzer, stopanalyzer, and whitespaceanalyzer.
* Directory: the location where the index is stored. Lucene provides two types of index storage locations: disk and memory. Generally, indexes are stored on disks. Correspondingly, Lucene provides two classes: fsdirectory and ramdirectory.
* Segment: segment, the most basic unit of Lucene index files. Lucene is constantly adding new segment, and then merging different segment based on certain rule algorithms to synthesize new segment.

The indexing process of Lucene is to convert the object to be indexed to the Document Object of Lucene, and use indexwriter to write it into the index file in the Lucene custom format. Objects to be indexed can be retrieved from files, databases, and other methods. Users can read files from directories or query database tables to obtain the resultset by coding. Lucene APIs are only responsible for dealing with strings.
1.1 field explanation

From the source code, we can see that the field constructor is as follows:

Field (string name, byte [] value, field. Store store)
Field (string name, reader)
Field (string name, reader, field. termvector)
Field (string name, string value, field. Store store, field. Index)
Field (string name, string value, field. Store store, field. Index, field. termvector)

There are three internal classes in field: field. index, field. Store, field. termvector. Where

* Field. index has four attributes:
Field. Index. tokenized: Word Segmentation Index
Field. Index. un_tokenized: Word Segmentation for indexing, such as the author name and date. Rod Johnson itself is a word and does not need word segmentation.
Field. index: the field. index is not indexed. It stores unsearchable content, such as some additional attributes of a document, such as the document type and URL.
Field. Index. no_norms :;
* Field. Store also has three attributes:
Field. Store. Yes: The index file originally stores only the index data. This design also stores the original content directly in the index file, such as the document title.
Field. store. no: the original text is not stored in the index file. After the search result is hit, reconnect to the original text based on other additional attributes such as the path of the file and the primary key of the database, it is suitable for situations where the original content is large.
Field. Store. Compress Compressed Storage;
* Termvector is newly added to Lucene 1.4.3. It provides a vector mechanism for fuzzy search, which is rarely used.

The field attribute mentioned above is quite different from that of javase1.4.3. In earlier version 1.4.3, Lucene uses field. keyword (...), fieldunindexed (...), fieldunstored (...) and field. text (...) to set the types of different fields for different purposes. index and field. store different combinations of fields to achieve the above effect.
Another note is that the default values of the two constructors are field. Store. No and field. Index. tokenized. :

Field (string name, reader)
Field (string name, reader, field. termvector)

* Limit field length:
The indexwriter class provides a setmaxfieldlength method to limit the length of the field. You can check the source code to see that the default value is 10000. You can reset this parameter during use. If the default value is used, Lucene only indexes the first 10000 terms of the document. documents exceeding this number will not be indexed.

1.2 merge, delete, and optimize Indexes

* The addindexes method in indexwriter combines indexes. If you need to merge indexes after creating indexes in different places, it makes sense to use the addindexes method.
* You can use the indexreader class to delete a document from an index. Indexreader is a very special class. It can be seen from the source code that it is constructed mainly through its own static method. Example:

Indexreader reader = indexreader. Open ("C: \ springside ");
Reader. deletedocument (x); // here, X is the constant of an int. This deletion method is not recommended.
Reader. deletedocument (new term ("name", "springside"); // This is another way to delete an index. It is recommended to delete an index by field.
Reader. Close ();

* Optimize indexes: You can use the optimize method of the indexwriter class for priority. It combines multiple segments to form a new segment, which can accelerate the search speed after the index is created. In addition, the optimize method reduces the indexing speed and increases the required disk space.

2. Several frequently used terms for searching

* Indexsearcher: it is the most basic search tool in Lucene. indexsearcher is used for all searches. To initialize indexsearcher, you need to set the index storage path so that the queryer can locate the index and search.
* Query: Query. Lucene supports fuzzy query, semantic query, phrase query, and combined query, for example, termquery, booleanquery, rangequery, and wildcardquery.
* Queryparser: a tool used to parse user input. You can scan user input strings to generate query objects.
* Hits: After the search is completed, the search result must be returned and displayed to the user. Only in this way can the search be completed. In Lucene, the set of search results is represented by instances of the hits class. The main methods of hits objects include:

Length (): returns the total number of search results. The hit method is used in the following simple usage.
DOC (int n): returns the nth document
Iterator (): returns an iterator.

Here, I would like to mention hits, which is also a wonderful place for Lucene. ANYONE FAMILIAR WITH hibernate knows that hibernate has a delayed loading attribute, and Lucene also has it. The hits object also uses the delayed Loading Method to return results. When you want to access a document, the hits object performs another Retrieval of Lucene indexes internally, the result is displayed on the page.

3. A simple example:

First, put the Lucene package in the classpath path and write the following simple class:

Public class fsdirectorytest {
// Index Creation Path
Public static final string Path = "C :\\ index2 ";

Public static void main (string [] ARGs) throws exception {
Document doc1 = new document ();
Doc1.add (new field ("name", "lighter springside com", field. Store. Yes, field. Index. tokenized ));

Document doc2 = new document ();
Doc2.add (new field ("name", "lighter blog", field. Store. Yes, field. Index. tokenized ));

Indexwriter writer = new indexwriter (fsdirectory. getdirectory (path, true), new standardanalyzer (), true );
Writer. adddocument (doc1 );
Writer. adddocument (doc2 );
Writer. Close ();

Indexsearcher searcher = new indexsearcher (PATH );
Hits hits = NULL;
Query query = NULL;
Queryparser QP = new queryparser ("name", new standardanalyzer ());

Query = QP. parse ("lighter ");
Hits = searcher. Search (query );
System. Out. println ("Search \" lighter \ "Total" + hits. Length () + "result ");

Query = QP. parse ("springside ");
Hits = searcher. Search (query );
System. Out. println ("Search \" springside \ "Total" + hits. Length () + "result ");

}
}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.