Actual combat Lucene, part 1th: initial knowledge of Lucene (Zhuan)

Last Update:2016-10-21 Source: Internet

Author: User

Tags create index

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

http://www.ibm.com/developerworks/cn/Java/j-lo-lucene1/

***************************************************

About Lucene

Lucene is a Java-based full-text information Retrieval toolkit, which is not a complete search application, but rather provides indexing and search capabilities for your application. Lucene is currently an open source project in the Apache Jakarta family. It is also the most popular open source full-Text Search toolkit based on Java.

There are already many applications that are based on Lucene, such as the search function of Eclipse's help system. Lucene can index text-type data, so you can index and search your documents as long as you are able to convert the text you want to index into the data format. For example, if you want to index some HTML documents, PDF documents, you first need to convert the HTML document and PDF document into text format, and then give the converted content to Lucene index, and then save the created index file to disk or memory, Finally, the query is made on the index file based on the query criteria entered by the user. Not specifying the format of the document to be indexed also allows Lucene to be applied to almost all search applications.

Figure 1 represents the relationship between the search application and Lucene, and also reflects the process of building a search application using Lucene:

Figure 1. The relationship between the search application and Lucene

Index and search

Index is the core of modern search engine, the process of indexing is the process of processing the source data into an index file that is very convenient to query. Why is indexing so important, just imagine that you are now searching for documents containing a keyword in a large number of documents, then if you do not index it you need to read these documents sequentially into memory, and then check whether the article contains the keywords you want to find, so it will take a lot of time, Think of the search engine but in the millisecond time to find out the results to search. That's why you can think of an index as a data structure that allows you to quickly and randomly access the keywords stored in the index to find the documents associated with that keyword. Lucene uses a mechanism called a Reverse index (inverted index). Reverse indexing that means we maintain a word/phrase table, and for each word/phrase in the table, there is a list that describes which documents contain the word/phrase. This allows the search results to be obtained very quickly when the user enters the query criteria. We'll cover the index mechanism of Lucene in detail in the second part of this series, and since Lucene provides an easy-to-use API, you can easily index your document using Lucene even if the reader is just beginning to index the full text.

Once you have indexed your documents, you can search for them on those indexes. Search engines will first analyze the keywords that are searched, and then find them on the established index, and eventually return the documents associated with the keywords entered by the user.

Lucene Package Analysis

The format of the Lucene package is a jar file, let's analyze the main JAVA package inside the jar file, so that the reader has a preliminary understanding of it.

Package:org.apache.lucene.document

This package provides some of the classes needed to encapsulate documents to be indexed, such as document, Field. In this way, each document is eventually encapsulated as a paper object.

Package:org.apache.lucene.analysis

The main function of this package is to segment the document, because the document must be preceded by a word breaker before indexing, so the function of this package can be considered as preparation for indexing.

Package:org.apache.lucene.index

This package provides classes to assist in creating indexes and updating the created indexes. There are two basic classes: IndexWriter and Indexreader, where IndexWriter is used to create an index and add a document to the index, and Indexreader is used to delete the document in the index.

Package:org.apache.lucene.search

This package provides the classes needed to search on a well-established index. For example, Indexsearcher and Hits, Indexsearcher defines the method of searching on the specified index, Hits is used to save the results of the search.

A simple Search Application

Assuming that our computer's directory contains a lot of text documents, we need to find which documents contain a keyword. To achieve this, we first use Lucene to index the documents in this directory, and then search for the document we are looking for in the built-in index. Through this example, readers will have a clearer idea of how to build their own search applications using Lucene.

Build an index

To index a document, Lucene provides five basic classes, namely document, Field, IndexWriter, Analyzer, and Directory. Let's take a look at the purpose of these five classes separately:

Document

Document is used to describe a document, which can refer to an HTML page, an e-mail message, or a text file. A Document object is made up of multiple Field objects. You can think of a document object as a record in a database, and each Field object is one of the fields of the record.

Field

A Field object is used to describe a property of a document, such as the title and content of an e-mail message, which can be described by two Field objects respectively.

Analyzer

Before a document is indexed, the first thing you need to do is word processing of the document content, which is done by Analyzer. The Analyzer class is an abstract class that has multiple implementations. Choose the right Analyzer for different languages and applications. Analyzer gives IndexWriter the content after the word breaker to index.

IndexWriter

IndexWriter is a core class used by Lucene to create an index, and his role is to add one Document object to the index.

Directory

This class represents the location where Lucene's index is stored, an abstract class that currently has two implementations, the first of which is Fsdirectory, which represents the location of an index stored in the file system. The second is Ramdirectory, which represents the location of an index stored in memory.

Once we are familiar with the classes needed to build the index, we begin to index the text file under a directory, and listing 1 shows the source code that indexes the text files in a directory.

Listing 1. To index a text file

Package testlucene; Import Java.io.File; Import Java.io.FileReader; Import Java.io.Reader; Import Java.util.Date; Import Org.apache.lucene.analysis.Analyzer; Import Org.apache.lucene.analysis.standard.StandardAnalyzer; Import org.apache.lucene.document.Document; Import Org.apache.lucene.document.Field; Import Org.apache.lucene.index.IndexWriter;      /** * This class demonstrate the process of creating index with Lucene * for text files */public class Txtfileindexer { public static void Main (string[] args) throws exception{//indexdir are the directory that hosts Lucene ' s index fi      Les file Indexdir = new File ("D:\\luceneindex");       DataDir is the directory of that hosts the text files, which to be indexed file DataDir = new file ("D:\\lucenedata");      Analyzer Luceneanalyzer = new StandardAnalyzer ();      file[] datafiles = Datadir.listfiles ();      IndexWriter indexwriter = new IndexWriter (indexdir,luceneanalyzer,true);  Long startTime = new Date (). GetTime ();    for (int i = 0; i < datafiles.length; i++) {if (Datafiles[i].isfile () && datafiles[i].getname (). End                Swith (". txt")) {System.out.println ("indexing file" + Datafiles[i].getcanonicalpath ());                Document document = new document ();                Reader Txtreader = new FileReader (datafiles[i]);                Document.add (Field.text ("Path", Datafiles[i].getcanonicalpath ()));                Document.add (Field.text ("Contents", Txtreader));           Indexwriter.adddocument (document);      }} indexwriter.optimize ();      Indexwriter.close ();              Long endTime = new Date (). GetTime (); System.out.println ("It takes" + (Endtime-starttime) + "milliseconds to create index for the files in director             Y "+ datadir.getpath ()); } }

In Listing 1, we notice that the constructor for class IndexWriter requires three arguments, and the first parameter specifies the location to which the index is created, which can be a File object or a Fsdirectory object or a Ramdirectory object. The second parameter specifies an implementation of the Analyzer class, which specifies which word breaker is used to segment the text block content. The third argument is a Boolean variable that, if true, represents the creation of a new index, or false, which represents the operation on the basis of the original index. The program then iterates through all the text documents under the directory and creates a document object for each of the text documents. The two properties of the text document are added to the two Field objects, then the two Field objects are added to the Document object, and finally the file is added to the index using the Add method of the IndexWriter class. This completes the creation of the index. Next we go to the part of the search on the established index.

Search Documents

Using Lucene to search is just as easy as building an index. In the above section, we have indexed the text document in a directory, and now we are going to search the index to find the document that contains a keyword or phrase. Lucene provides several basic classes to complete the process, namely Indexsearcher, term, Query, Termquery, Hits. Here we describe the functions of each of these classes.

Query

This is an abstract class, he has multiple implementations, such as Termquery, Booleanquery, Prefixquery. The purpose of this class is to encapsulate the query string entered by the user into the queries that Lucene can recognize.

Term

Term is the basic unit of search, and a term object consists of two fields of type String. Generating a term object can be done with one of the following statements: term term = new term ("FieldName", "Queryword"); The first parameter represents the Field on which the document is to be searched, and the second parameter represents the keyword to be queried.

Termquery

Termquery is a subclass of abstract class query, which is also the most basic query class supported by Lucene. Generating a Termquery object is completed by the following statement: Termquery termquery = new Termquery (New term ("FieldName", "Queryword"); Its constructor accepts only one parameter, which is a term object.

Indexsearcher

Indexsearcher is used to search on a well-established index. It can only open an index in a read-only manner, so multiple instances of Indexsearcher may be manipulated on an index.

Hits

Hits is used to save the results of a search.

After describing the classes necessary for these searches, we begin to search on the previously established index, listing 2 shows the code needed to complete the search function.

Listing 2: Searching on a well-established index

 Package testlucene;  Import Java.io.File;  Import org.apache.lucene.document.Document;  Import Org.apache.lucene.index.Term;  Import org.apache.lucene.search.Hits;  Import Org.apache.lucene.search.IndexSearcher;  Import Org.apache.lucene.search.TermQuery;  Import Org.apache.lucene.store.FSDirectory; /** * This class was used to demonstrate the * process of searching on an existing * Lucene index * * */public class T     xtfilesearcher {public static void main (string[] args) throws exception{String querystr = "Lucene";         The the directory that hosts the Lucene index file Indexdir = new File ("D:\\luceneindex");         Fsdirectory directory = fsdirectory.getdirectory (indexdir,false);         Indexsearcher searcher = new Indexsearcher (directory);          if (!indexdir.exists ()) {System.out.println ("The Lucene index is not exist");         Return         Term term = new term ("contents", Querystr.tolowercase ()); Termquery lucenequery = newTermquery (term);         Hits Hits = Searcher.search (lucenequery);          for (int i = 0; i < hits.length (); i++) {Document document = Hits.doc (i);         System.out.println ("File:" + document.get ("path")); }  }  }

In Listing 2, the constructor for class Indexsearcher accepts an object of type directory, directory is an abstract class, and it currently has two subclasses: Fsdirctory and Ramdirectory. In our program, a Fsdirctory object is passed in as its argument, which represents the location of an index stored on disk. Once the constructor is executed, it represents the Indexsearcher to open an index in a read-only manner. Then our program constructs a term object that, through this term object, we specify to search the contents of the document for a document containing the word "Lucene". The term object is then used to construct the Termquery object and the Termquery object is passed into the Indexsearcher search method for querying, and the returned result is saved in the Hits object. Finally, we used a looping statement to print out the path of the searched document. Well, our search application has been developed, how to use Lucene to develop a search application is not very simple.

Summarize

This article first introduces some basic concepts of lucene, and then develops an application that demonstrates the process of indexing with lucene and searching on that index. I hope this article can help the readers who are learning Lucene.

Actual combat Lucene, part 1th: initial knowledge of Lucene (Zhuan)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More