Lucene and indexing and searching processes

Source: Internet
Author: User
Tags createindex

The reason behind Lucene's popularity and success is its simplicity.

Therefore, you do not need to have a deep understanding of Lucene's information indexing and retrieval knowledge.

Lucene provides simple but powerful core APIs for full-text indexing and retrieval. You only need to master a few classes to integrate Lucene into applications.

People who are new to Lucene may mistakenly think that Lucene is a file search tool, web crawler, or Web search engine. In fact, Lucene is a software library, rather than a full-featured search application.Program. It involves full-text indexing and search, and does a good job. Lucene allows your applications to hide complex indexing and search operations, and uses simple APIs to process specific problem fields and business rules. As you can imagine, Lucene is like a layer where your application is located.

Lucene allows you to add indexing and search functions to applications. Lucene does not care about the data source. Lucene can index and search any data that can be converted to text format. This means that you can use Lucene to index and search for data: web pages on remote web servers, documents stored in local file systems, simple text files, Microsoft Word documents, HTML or PDF files, or any other format file that can extract text information from it.

The core of all search engines is the index concept: the original data is processed into an efficient cross-reference search for quick search. Let's take a look at the fast and efficient indexing and searching process.

1. What is an index and why is it so important?

If you need to search for a large number of files, you want to find those files that contain a word or phrase. How will you write a program to implement this function? One approach is to scan each file in sequence and search for words or phrases that contain a given word. However, this method has many drawbacks. The most obvious one is that the speed is unacceptable when a large number of files exist. In this case, the index is generated. To search for a large amount of text, you must first store the text in a specific structure. This storage structure allows you to quickly search and eliminate the slow sequential scanning process. This storage structure is called index. The process of converting text into a specific structure is called index creation.

As a data structure, indexes allow you to quickly and randomly access words stored in them. Similar to a dictionary directory, a word corresponds to a page and is directly located on that page during search, which is very fast and does not need to be searched by page or page. Lucene index is a specially designed data structure, which is usually stored in a file system as a set of index files.

2. What is search?

Search for keywords in the index. The process of finding a document containing the keywords is to search. Search quality is generally described using accuracy and recall rate. The recall rate refers to the ratio of the number of matched users in a search result set to the total number related to the user query, accuracy refers to the ratio of the number of matched users in a search result set to the total number of results in the search. We also need to consider other search factors,Such as speed and the ability to quickly search a large number of texts,It is also important to support single and multiple queries, phrase queries, wildcards, and ranking and sorting of results.

3. Lucene in action

Suppose we need to index and search for files stored in a directory.

Before using Lucene for search, we need to create an index first.The Lucene version used is 3.6.

3.1 create an index

1) create directory for storing Indexes

2) Create the indexer configuration management class indexwriterconfig

3) create an index using the index directory and configuration management class

4) use the indexer to write the document to the index file.

Indexer class:

/***** Indexer * @ author luxh */public class indexer {/***** create an Index * @ Param filepath: storage path of the file to which the index needs to be created * @ throws ioexception */ public static void createindex (string filepath) throws ioexception {// create a directory named indexdir in the current path file indexdir = new file (". /indexdir "); // create the index directory directory = fsdirectory. open (indexdir); // create a analyzer = new standardanalyzer (version. required e_36); // create the index configurator indexwriterconfig Indexwriterconfig = new indexwriterconfig (version. paie_36, analyzer); logmergepolicy mergepolicy = new logbytesizemergepolicy (); // set the merging frequency when the segment adds a document (document). // The value is small, index creation speed is slow // The value is large, and index creation speed is fast.> 10 is suitable for batch index creation of mergepolicy. setmergefactor (50); // you can specify the maximum number of documents to be merged in a segment statement. // if the number of documents to be merged is small, the append index is faster. // The value is greater, it is suitable for batch indexing and faster searching for mergepolicy. setmaxmergedocs (5000); // enable the composite index file format and Merge multiple segmentmergepolicies. setusecompoundfile (true); indexwr Iterconfig. setmergepolicy (mergepolicy); // sets the index enable mode indexwriterconfig. setopenmode (openmode. create_or_append); // create the indexwriter = new indexwriter (directory, indexwriterconfig); file filedir = new file (filepath); For (File file: filedir. listfiles () {// document is the Lucene document structure. All objects to be indexed must be converted to documentdocument document = new document (); // file name, queryable, word segmentation, stored in the index database record document. add (new field ("name", getfilename (F Ile), store. yes, index. analyzed); // file path, which can be queried without word segmentation. It is stored in the document in the index database record. add (new field ("path", file. getabsolutepath (), store. yes, index. not_analyzed); // large text content, which can be queried and not stored. The actual text content can be found based on the file path. // document. add (new field ("content", new filereader (File); // small text content, which can be stored in the index record library document. add (new field ("content", getfilecontent (file), store. yes, index. analyzed); // Add the document to the index library indexwriter. adddocument (document);} // submit the index to the index database on the disk and disable the index device I Ndexwriter. Close ();}/*** get file name */public static string getfilename (File file) {string filename = ""; if (file! = NULL) {filename = file. getname (). substring (0, file. getname (). lastindexof (". ");} return filename;}/*** get text * @ Param file */public static string getfilecontent (File file) {filereader Fr = NULL; bufferedreader BR = NULL; string content = ""; try {Fr = new filereader (File); BR = new bufferedreader (FR); stringbuffer sb = new stringbuffer (); string line = BR. readline (); While (null! = Line) {sb. append (line); line = BR. readline ();} content = sb. tostring ();} catch (exception e) {e. printstacktrace ();} finally {try {If (fr! = NULL) Fr. Close (); If (BR! = NULL) Br. Close ();} catch (ioexception e) {e. printstacktrace () ;}} return content ;}}

Indexwriter: The indexer that is used to create and maintain an index.

In ipve3.6, only one constructor is recommended.Indexwriter(Directory D, indexwriterconfig conf), and other constructor methods are outdated. All indexwriter configurations are managed through indexwriterconfig.

Indexwriterconfig: The indexer configuration class that manages all configurations related to the indexer. Only one ConstructorIndexwriterconfig(Version matchversion, analyzer). The matchversion parameter in the constructor is the Lucene version, and analyzer is the analyzer.

Next, run the indexer to create an index.

Public class testindexer {/*** create Index * @ throws ioexception */@ testpublic void testcreateindex () throws ioexception {// directory path for storing the files to be indexed string filepath = ". /filedir "; // call the index creation method indexer Of The indexer. createindex (filepath );}}

In this way, we create an index for the files in filedir in the current path.

3.2 search

Searching in Lucene is as simple and fast as creating an index. Now, we create a searcher to search for files containing specific text.

1) UseQueryparser parses the query keywords into Lucene's query object query. When creating queryparser, we need to use a word divider, which must be consistent with the word divider used for index creation.

2) use fsdirectory to open the directory where the index is located.

3) Use indexreader to read the index directory And use indexsearcher to search.

4) return topdocs as the search result object. Topdocs contains the scoredocs array of the total number of searched results and the result set

5) traverse the scoredocs array of the result set and obtain the document according to the document number of each scoredoc.

Look at the searcher'sCode:

/*** Searcher * @ author luxh */public class searcher {/*** search * @ Param keyword search keyword * @ Param indexdir index directory path * @ throws parseexception * @ throws ioexception * @ return list <document> */public static list <document> Search (string keyword, string indexdirpath) throws parseexception, ioexception {string [] fields = {"name", "content"}; // create a word divider, analyzer analyzer = new standardanalyzer (version. required e_36); // create the query parser queryparser = new multifieldqueryparser (version. paie_36, fields, analyzer); // parses query keywords into Lucene's query object query = queryparser. parse (keyword); // open the index directory file indexdir = new file (indexdirpath); directory = fsdirectory. open (indexdir); // obtain the index access interface to search for indexreader = indexreader. open (directory); indexsearcher = new indexsearcher (indexreader); // topdocs search result returned by topdocs = indexsearcher. search (query, 100); // returns only the first 100 records int totalcount = topdocs. totalhits; // total number of search results system. out. println ("Total number of searched results:" + totalcount); scoredoc [] scoredocs = topdocs. scoredocs; // list of search result sets <document> docs = new arraylist <document> (); For (scoredoc: scoredocs) {// document No. Int docid = scoredoc.doc; // obtain document DOC = indexsearcher.doc (docid); docs. add (DOC);} indexreader. close (); indexsearcher. close (); Return docs ;}}

Next we run the searcher:

 
Public class testsearcher {/*** search */@ testpublic void testsearch () throws ioexception, parseexception {// search keyword string keyword = "Java "; // index directory path string indexdirpath = ". /indexdir "; // call the searcher to search for list <document> docs = searcher. search (keyword, indexdirpath); For (document DOC: Docs) {system. out. println ("file name:" + Doc. get ("name"); system. out. println ("Path:" + Doc. get ("path"); system. out. println ("content:" + Doc. get ("content "));}}}

If a file contains a keyword, it will be searched out.

 

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.