Initial knowledge of Lucene

Source: Internet
Author: User

Because the business needs, although they are not specifically written search, but need to spell some search conditions to invoke the search interface, and the previous view of the JVM crash also involved in Lucene, so probably understand.

Reference Documentation:

http://www.iteye.com/topic/839504

Http://www.cnblogs.com/xing901022/p/3933675.html

I. Introduction of Lucene

Lucene is a Java-based full-text information Retrieval toolkit, which is not a complete search application, but rather provides indexing and search capabilities for your application. Lucene is currently an open source project in the Apache Jakarta family. It is also the most popular open source full-Text Search toolkit based on Java.

There are already many applications that are based on Lucene, such as the search function of Eclipse's help system. Lucene can index text-type data, so you can index and search your documents as long as you are able to convert the text you want to index into the data format. For example, if you want to index some HTML documents, PDF documents, you first need to convert the HTML document and PDF document into text format, and then give the converted content to Lucene index, and then save the created index file to disk or memory, Finally, the query is made on the index file based on the query criteria entered by the user. Not specifying the format of the document to be indexed also allows Lucene to be applied to almost all search applications.

Take Lucene4.0 as an example: Official document http://lucene.apache.org/core/4_0_0/core/overview-summary.html

This is one of the five most commonly used files:

The first, and most important,Lucene-core-4.0.0.jar, which includes commonly used documents, indexes, searches, storage and other related core code.

The second,Lucene-analyzers-common-4.0.0.jar, contains lexical analyzers for various languages that are used to slice and extract the contents of a file.

Third,Lucene-highlighter-4.0.0.jar, this jar package is used primarily for searching for content highlighting.

The fourth and fifth,Lucene-queryparser-4.0.0.jar, provide search-related code for various searches, such as fuzzy Search, range search, and so on.

Second, index and search

Index is the core of modern search engine, the process of indexing is the process of processing the source data into an index file that is very convenient to query. Why is indexing so important, just imagine that you are now searching for documents containing a keyword in a large number of documents, then if you do not index it you need to read these documents sequentially into memory, and then check whether the article contains the keywords you want to find, so it will take a lot of time, Think of the search engine but in the millisecond time to find out the results to search. That's why you can think of an index as a data structure that allows you to quickly and randomly access the keywords stored in the index to find the documents associated with that keyword. Lucene uses a mechanism called a Reverse index (inverted index). Reverse indexing that means we maintain a word/phrase table, and for each word/phrase in the table, there is a list that describes which documents contain the word/phrase. This allows the search results to be obtained very quickly when the user enters the query criteria. We'll cover the index mechanism of Lucene in detail in the second part of this series, and since Lucene provides an easy-to-use API, you can easily index your document using Lucene even if the reader is just beginning to index the full text.

Once you have indexed your documents, you can search for them on those indexes. Search engines will first analyze the keywords that are searched, and then find them on the established index, and eventually return the documents associated with the keywords entered by the user.

Figure 1 represents the relationship between the search application and Lucene, and also reflects the process of building a search application using Lucene:

Third, the Code interpretation

Below for an example of the official website above, for analysis:

 1 Analyzer Analyzer = new StandardAnalyzer (version.lucene_current); 2 3//Store the index in memory:4 directory directory = new ramdirectory (); 5//To store a index on disk with this instead:6//directory Directory = Fsdirectory.open ("/tmp/testindex"); 7 indexwriterconfig config = new Indexwriterconfig (version.lucene_current, analyzer); 8 IndexWriter iwriter = new IndexWriter (directory, config); 9 Document doc = new document (); String text = "The text to be indexed."; One Doc.add (New Field ("FieldName", Text, textfield.type_stored)); Iwriter.adddocument (doc); Iwriter.close () ; +/Now search the index:16 directoryreader ireader = directoryreader.open (directory); indexsearch Er isearcher = new Indexsearcher (ireader);//Parse A simple query this searches for "text": Queryparser parse R = new Queryparser (version.lucene_current, "fieldname", analyzer), query query = parser.parse ("text"); 21     Scoredoc[] hits = isearcher.search (query, NULL, N). Scoredocs;22 assertequals (1, hits.length);//Iterate Through the results:24 for (int i = 0; i < hits.length; i++) {Document Hitdoc = Isearcher.doc (hits[i].doc     ); Assertequals ("The text to be indexed.", Hitdoc.get ("FieldName"));}28 Ireader.close (); 29 Directory.close ();

  

Creation of indexes

  First, we need to define a lexical parser.

Like a sentence, "I love our china!" ", how to split, buckle down the pause word", extract the keyword "i" "We" "China" and so on. This is accomplished with the help of the Lexical Analyzer Analyzer. This is used in the standard lexical analyzer, if specifically for Chinese, can also be used with paoding.

1 Analyzer Analyzer = new StandardAnalyzer (version.lucene_current);

The version.lucene_current in the parameter, which represents the use of the current LUCENE version, can also be written as version.lucene_40 in this context.

  

  The second step is to determine the location of the index file storage, which Lucene provides to us in two ways:

1 Local file storage

Directory directory = fsdirectory.open ("/tmp/testindex");

2 Memory storage

Directory directory = new ramdirectory ();

Can be set according to your own needs.

   

 The third step is to create the IndexWriter and write the index file.

Indexwriterconfig config = new Indexwriterconfig (version.lucene_current, analyzer); IndexWriter iwriter = new IndexWriter (directory, config);

Here Indexwriterconfig, according to the official documentation, is the configuration of the IndexWriter, which contains two parameters, the first is the current version, and the second is the Lexical Analyzer Analyzer.

  

  The fourth step, the content extraction, carries on the index storage.

Document doc = new document (); String Text = "This was the text to be indexed."; Doc.add (New Field ("FieldName", Text, textfield.type_stored)); Iwriter.adddocument (doc); Iwriter.close ();

The first line, which applies a Document object, is similar to a row in a table in the database.

The second line is the string we are about to index.

On the third line, store the string (because textfield.type_stored is set, if you do not want to store it, you can use other parameters, refer to the official document for details), and store "show" as "FieldName".

Row four, add the Doc object to the index creation.

Five lines, close the IndexWriter, submit the creation content.

  

This is the process of index creation.

Keyword query:

  The first step is to open the storage location

Directoryreader Ireader = directoryreader.open (directory);

Second step, create the Finder

Indexsearcher isearcher = new Indexsearcher (Ireader);

The third step, similar to SQL, for keyword query

Queryparser parser = new Queryparser (version.lucene_current, "fieldname", analyzer); Query query = parser.parse ("text"); Scoredoc[] hits = isearcher.search (query, NULL, N). Scoredocs;assertequals (1, hits.length); for (int i = 0; i < HITS.L Ength; i++) {    Document Hitdoc = Isearcher.doc (hits[i].doc);    Assertequals ("This is the text to be indexed.", Hitdoc.get ("FieldName"));}

Here, we create a query and set its lexical parser, and the query's "table name" is "FieldName". The query results return a collection of SQL-like resultset, where we can extract the content stored in it.

For a variety of different query methods, you can refer to the official manual, or the recommended PPT

 Fourth step, turn off the Finder, and so on.

Ireader.close ();d irectory.close ();

Initial knowledge of Lucene

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.