Lucene Getting Started instance

Source: Internet
Author: User
Tags commit data structures object model

First, Introduction

What Lucene is: Lucene is a subproject of the Apache Software Foundation Jakarta Project Group, an open source full-Text Search engine toolkit, which is not a full-text search engine, but a full-text search engine architecture that provides a complete query engine and index engine. Part of the text analysis engine (English and German two Western languages).    Lucene's goal is to provide software developers with a simple and easy-to-use toolkit to facilitate full-text retrieval in the target system, or to build a complete full-text search engine on this basis. Lucene is a Java-based full-text search, not a complete search application, but a code base and API that makes it easy to provide search functionality for your app. In fact, the function of Lucene is to index a number of strings provided by the developer, and then provide a full-text search service that the user will search for in the search service, and the search service tells the user the string that the keyword appears.

Second, the basic process

Lucene consists of two parts: indexing and search services. Indexing is either writing the source (essentially a string) to the index or deleting the source from the index, and searching for a full-text search service that allows the user to locate the source by keyword.

1. The process of establishing an index

Use analyzer to process source strings, including: participle, which is divided into words; remove stopword (optional). The valid information in the source is added to the document in the form of a different field, and the document is indexed to record the valid field in the index. Writes an index to memory (memory or disk). Users to provide search keywords, through the analyzer processing. Find the corresponding document for the Processed keyword search index. The user extracts the required field from the document that is found as needed.

2. Process of retrieval

Users to provide search keywords, through the analyzer processing. Find the corresponding document for the Processed keyword search index. The user extracts the required field from the document that is found as needed. Iii. Basic Concepts

1. Analyzer

Analyzer acts as a word breaker and removes invalid words from the string.

The purpose of participle is to divide a string into several words according to some semantic rules. English is easier to implement participle, because the English itself is a word unit, has been separated by a space, while the Chinese must be a way to divide into a sentence into a word. Invalid words, such as "of" in English, "the" and "the" and "ground" in the Chinese language, are appearing extensively in the articles. However, it does not contain the key information, which is helpful to reduce the index file, improve the hit rate and efficiency of execution.

2. Document

A user-supplied source can be a text file, a string, or a record in a database table. Once a source string has been indexed, it is stored in an index file as a document. The results of the search service are also returned as a list of the document.

3. Field

A document can contain more than one information field, such as an article that can contain information fields such as title, body, last modified, and so on, which are stored as field fields in document.

Field has two properties: Store and index. The store property can control whether the field is stored or not, and the indexed property can control whether the field is indexed. This seems superfluous, but in fact it is important to have the correct combination of the two properties.

Here is an example of an article that requires a full-text search of the title and body, so the indexed property of the two field is set to True, and you want to be able to extract the title of the article directly from the search results, so set the storage property of the title field to True. But the text field is too large, in order to reduce the index file, the text field of the storage property is set to false, need access to read the file body directly, you want to be able to extract the last modification time from the search results, but do not need to search for it, so the last modified Time field storage property is set to True, The indexed property is set to False.

The two properties of field prohibit all false cases because this does not make sense for indexing.

4. Segment

When indexing, not every document is immediately added to the same index file, they are written to a different small file, and then merged into a large index file, each small file is a segment.

5. Term

Term represents a word in a document and is the smallest unit of search. Term consists of two parts: the expression of the word and the field in which the word appears.

6. Token

Token is a occurrence of a term that contains trem text and corresponding start and end offsets, as well as a type string. A sentence can appear multiple times the same words, which are expressed in the same term, but with different tokens, each token marks the place where the word appears.

Iv. composition and structure of Lucene

Lucene core consists of 8 packages: analysis, collation, document, index, Queryparser, search, store, util.

1. Analysis Package (text parsing library)

This library is to parse the text, get the corresponding term object, in index and searcher will be used, in the process of indexing, it will be indexed file parsing, and in the search process, it will parse the query conditions.

Analysis provides a variety of analyzer, such as the Whitespaceanalyzer by blank character participle, added Stopword filter stopanalyzer, support Chinese word segmentation smartchineseanalyzer, And the most commonly used standardanalyzer.

2. Collation Bag

A class that contains two identical functions for collationkeyfilter and collationkeyanalyzer, converting all tokens to Collationkey, The Collationkey is encoded with the indexablebinarystringtools and stored as a term.

3. Document package (an object model abstract library of documents)

The document package is a variety of data structures related to document, such as document class, Field class, and so on.

4. Index package (indexed library)

The index package is an indexed read-write operation class, commonly used is the IndexWriter class for writing, merging, and optimizing the segment of index files, and the Indexreader class for read and delete operations on indexes. IndexWriter only cares about how to write indexes into Segment and combine them to optimize, indexreader to focus on the organization of each document in the index file. It is primarily used for indexing documents, maintaining indexes (updating indexes, optimizing indexes, and so on), and providing access interfaces for queries.

5. Queryparser Bag

The Queryparser package is the class that parses the query statement (commonly known as the Queryparser Class) and the token class.

6. Search package (Query library)

The search package is a variety of different query classes (such as Termquery, booleanquery, etc.) and search result set hits classes that are searched from the index. It mainly provides the user with the query interface, evaluates the user's query results, and returns the most relevant query information to the user.

7. Store package (repository)

The store package is an indexed storage-related class, such as the directory class defines the storage structure of the index file, Fsdirectory is the index storage class stored in the file system (that is, the disk), and ramdirectory is the index storage class stored in memory. Mmapdirectory is the storage class for an index that uses memory mapping. It is primarily used to store indexed content, which provides two ways of storing, one is disk storage and the other is memory storage.

8. Util Package (Tool library)

The util package is a common tool class, such as a conversion tool between time and string.

V. Environment Construction and testing

Two core packages of Lucene were introduced into the project: Lucene-core-3.6.2.jar and Lucene-core-3.6.2-javadoc.jar;

In the Windows system of the D-disk, the establishment of a folder called Lucene, the Lucene folder under the Temp folder, the temp under the two folders, the name is docs (stored text file) and index (index file). Create three txt files in the Docs folder, such as: 1.txt,2.txt,3.txt;3 TXT file to write anything.

1. Build the index:

Package com.javaeye.lucene;
Import Java.io.BufferedReader;
Import Java.io.File;
Import Java.io.FileInputStream;
Import java.io.IOException;
Import Java.io.InputStreamReader;

Import Java.util.Date;
Import Org.apache.lucene.analysis.Analyzer;
Import Org.apache.lucene.analysis.standard.StandardAnalyzer;
Import org.apache.lucene.document.Document;
Import Org.apache.lucene.document.Field;
Import Org.apache.lucene.document.Field.TermVector;
Import Org.apache.lucene.index.IndexWriter;
Import Org.apache.lucene.index.IndexWriterConfig;
Import Org.apache.lucene.index.IndexWriterConfig.OpenMode;
Import Org.apache.lucene.store.Directory;
Import Org.apache.lucene.store.FSDirectory;

Import org.apache.lucene.util.Version; @SuppressWarnings ("deprecation") public class Lucenetest {/** * @param args * @throws ioexception */Public STA tic void Main (string[] args) throws IOException {/** indicates the location of the folder to be indexed, here is the D:\lucene\temp\docs folder */File F Iledir=new File ("D:" +file.separator+"Lucene" +file.separator+ "temp" +file.separator+ "docs"); /** location of the index file */file Indexdir=new file ("D:" +file.separator+ "Lucene" +file.separator+ "temp" +file.separator+ "index
             ");
             Analyzer luceneanalyer=new StandardAnalyzer (version.lucene_36);
             Directory Dir=fsdirectory.open (Indexdir);
             Indexwriterconfig iwc=new Indexwriterconfig (Version.lucene_36,luceneanalyer);
             Iwc.setopenmode (openmode.create);
             IndexWriter indexwriter=new IndexWriter (DIR,IWC);
             When the index library is not established and there is no index file, first commit to let him set up an index library version information, if the first time without a commit to open an index reader will be reported abnormal//indexwriter.commit ();
             File[] Textfiles=filedir.listfiles ();
             	Long Starttime=new Date (). GetTime (); Add document to index go for (int i=0;i<textfiles.length;i++) {if (Textfiles[i].isfile () && te Xtfiles[i].getname (). EndsWith (". txt")) {System.out.println ("File" +textfiles[i].getcanoniCalpath () + "is being indexed.
            		 ");
            		 String Temp=filereaderall (Textfiles[i].getcanonicalpath (), "GBK");
            		 SYSTEM.OUT.PRINTLN (temp);
            		 Document Document=new document ();
            		 Create a field fieldpath=new field ("Path", Textfiles[i].getpath (), field.store.yes,field.index.no);
            		 Field Fieldbody=new field ("Body", temp,field.store.yes,field.index.analyzed,termvector.with_positions_offsets);
            		 Document.add (Fieldpath);
            		 Document.add (Fieldbody);
            		 
            	 Indexwriter.adddocument (document);
             }} indexwriter.close ();
             Long Endtime=new Date (). GetTime (); SYSTEM.OUT.PRINTLN ("This cost" + (Endtime-starttime) + "milliseconds to add the document to the index.)
             
	"+filedir.getpath ()); } public static string Filereaderall (String filename,string charset) throws ioexception{BufferedReader reader=new Buffe Redreader (New InputStreamReader (new FileInputStream (filename), CharSet));
		String Line=new string ();
		String Temp=new string ();
		while ((Line=reader.readline ())!=null) {temp+=line;
		} reader.close ();
	return temp;
 }

}

Operation Result:

Filed:\lucene\temp\docs\1.txt is being indexed.
Give me your love.
Filed:\lucene\temp\docs\2.txt is being indexed.
Love came so delicately-yeah!
Filed:\lucene\temp\docs\3.txt is being indexed.
Because guardian of our love ...
This takes 419 milliseconds to add the document to the index. D:\lucene\temp\docs
2. Once the index has been established, the query

Package com.javaeye.lucene;
Import Java.io.File;

Import java.io.IOException;
Import Org.apache.lucene.analysis.Analyzer;
Import Org.apache.lucene.analysis.standard.StandardAnalyzer;
Import Org.apache.lucene.index.IndexReader;
Import org.apache.lucene.queryParser.ParseException;
Import Org.apache.lucene.queryParser.QueryParser;
Import Org.apache.lucene.search.IndexSearcher;
Import Org.apache.lucene.search.Query;
Import Org.apache.lucene.search.ScoreDoc;
Import Org.apache.lucene.search.TopDocs;
Import Org.apache.lucene.store.FSDirectory;

Import org.apache.lucene.util.Version; public class Querytest {/** * @param args */@SuppressWarnings ("deprecation") public static void Main (string[] Ar
         GS) throws Ioexception,parseexception {String index= "D:\\lucene\\temp\\index";
         File File=new file (index);
         if (file!=null) {System.out.println (File.getcanonicalpath ());
       } Indexreader Reader=indexreader.open (Fsdirectory.open (file));  Indexsearcher searcher=new indexsearcher (reader);
         Scoredoc[] Hits=null;
         String querystr= "Love";//search keyword Query query=null;
        Analyzer analyzer=new StandardAnalyzer (version.lucene_36);;
        	  try{queryparser qp=new queryparser (version.lucene_36, "body", analyzer);
         Query=qp.parse (QUERYSTR);
         }catch (ParseException e) {e.printstacktrace ();
        	 } if (searcher!=null) {Topdocs results=searcher.search (query,10);
        	 Hits=results.scoredocs;
            	 if (hits.length>0) {System.out.println ("find" +hits.length+ "result");
            	 for (int i=0;iOperation Result:

D:\lucene\temp\index
Found 3 results
Love came so delicately-yeah!
Because guardian of our love ...
Give me your love.
Successful completion of this test .....

Ps:

Lucene is really simple, and it's mostly about doing two things: indexing and searching.
Look at some of the terminology used in Lucene, here is not intended to be a detailed introduction, just a little bit----because there is a good thing in this world, called search.

One of the most important classes in Indexwriter:lucene, which is primarily used to index documents and to control some of the parameters in the indexing process.

Analyzer: A parser that is used primarily to analyze various texts encountered by search engines. Commonly used are StandardAnalyzer Analyzer, Stopanalyzer Analyzer, Whitespaceanalyzer Analyzer and so on.

Directory: where the index resides; Lucene provides two index locations, one for disk and one for memory. The index is generally placed on disk, and Lucene provides Fsdirectory and ramdirectory two classes accordingly.

Document: Documents, which are equivalent to a unit to be indexed, any file that can be indexed must be converted to a Document object in order to be indexed.

Field: Fields.

Indexsearcher: It is the most basic retrieval tool in Lucene, and all the retrieval will use Indexsearcher tool;

Query: Queries, Lucene support fuzzy query, semantic query, phrase query, combination query and so on, such as Termquery,booleanquery,rangequery,wildcardquery and other classes.

Queryparser: is a tool that parses user input and can generate a query object by scanning a string entered by the user.

Hits: After the search is complete, the search results need to be returned and displayed to the user, only this is the purpose of completing the search. In Lucene, the collection of search results is represented by an instance of the Hits class.





Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.