Lucene3.6 Getting Started example

Source: Internet
Author: User
Tags create index createindex
First, Introduction
What is Lucene: Lucene is a subproject of the Apache Software Foundation Jakarta Project Group, an open source Full-text search Engine toolkit, which is not a full Full-text search engine, but a full-text search engine architecture that provides a complete query engine and indexing engine, Part of the text analysis engine (English and German two Western languages). The purpose of Lucene is to provide software developers with an Easy-to-use toolkit to facilitate full-text retrieval in the target system, or to build a complete full-text search engine on this basis.

Lucene is a full-text search based on Java, not a complete search application, but a code base and API that can easily provide search capabilities for applications. In fact, the function of Lucene is to index a number of strings that are provided by developers, and then provide a Full-text search service, where the user provides the search service with a search service that tells the user the various strings that appear in the keyword.


second, the basic process Lucene consists of two parts: indexing and search services. Indexing is an index that writes the source (essentially a string) to the index or deletes the source from the index; Search is to provide the user with Full-text Search service, the user can locate the source through the keyword.

1, the process of establishing the index
Use analyzer to process source strings, including: participle, which is divided into words; Stopword (optional).
The valid information in the source is added to the document as a different field, and the document is indexed so that a valid field is recorded in the index.

Writes an index to storage (memory or disk).


2, the search process
Users to provide search keywords, through analyzer processing.
Find the corresponding document for the keyword search index after processing.
The user extracts the required field from the found document as needed.

Iii. Basic Concepts
1. Analyzer
The role of analyzer is participle and removes the invalid words in the string.
The purpose of participle is to divide the string into several words according to some semantic rules. English is easier to achieve participle, because the English language itself is the word as a unit, has been separated from the space, and Chinese must be in some way to divide into one sentence into a single word. Invalid words, such as "of", "the" and "the", "the", and "the" in the Chinese, these words appear in a large number of articles. However, it does not contain key information, and it helps to reduce index files, improve hit ratio and execute efficiency.

2. Document
A user-supplied source can be a text file, a string, or a record in a database table, and so on. Once a source string is indexed, it is stored in the index file as a document. The results of the search service are also returned as a list of document.

3. Field
A document can contain multiple information fields, such as "title", "Body", "last Modified", and other information fields, which are stored in the document as field.
Field has two properties: storage and indexing. A storage property can control whether the field is stored, and the indexed property can control whether the field is indexed. This seems superfluous, but the fact is that the correct combination of these two attributes is important.

Here is an example of an article that requires a full-text search of the title and body, so that the indexed properties of the two field are set to True, and you want to be able to extract the title of the article directly from the search results, set the storage property of the title field to True. But the text field is too big, to narrow the index file, set the storage property of the body field to false, and then read the body of the file directly when you need it; you want to be able to extract the last modified time from the search results, but you do not need to search it, so set the storage property of the last modified Time field to True The indexed property is set to False.
The two properties of field prohibit all false cases because it has no meaning for indexing.

4. Segment

When indexing is made, not every document is immediately added to the same index file, which is first written to a different small file and then merged into a large index file, each small file being a segment.

5. Term
Term represents a word for a document and is the smallest unit of search. Term consists of two parts: the expression and the field in which the word appears.

6. Token
Token is a term occurrence that contains trem text and corresponding starting and ending offsets, and a type string. In a sentence, multiple identical words can appear, all with the same term, but with different token, each token the position where the word appears.

Four, the composition structure of Lucene

Lucene includes both core and sandbox, where core is the heart of Lucene, sandbox contains some additional functions, such as highlighter, various analyzers, and so on. The Lucene core contains 8 packages: analysis, collation, document, index, Queryparser, search, store, util. 1. Analysis Package
Analysis provides a variety of analyzer, such as the Whitespaceanalyzer by the blank character participle, added Stopword filter stopanalyzer, support the smartchineseanalyzer of Chinese participle, And the most commonly used standardanalyzer.

2. Collation Package
A class that contains two identical functions for Collationkeyfilter and Collationkeyanalyzer, converts all token to Collationkey, The Collationkey is encoded with indexablebinarystringtools as a term.

3. Document Package
The document package is a document-related data structure, such as document class, Field class, and so on.

4. Index Package
The index package is an indexed read-write operation class that is commonly used to write, merge, and optimize the segment of index files, indexwriter classes, and Indexreader classes that read and delete indexes. IndexWriter only cares about writing indexes to Segment and merging them to optimize; Indexreader focus on the organization of the documents in the index file.

5. Queryparser Package
The Queryparser package is the class that resolves query statements (commonly used queryparser classes) and token classes.

6. Search Package

Search packages are a variety of different query classes (such as Termquery, booleanquery, etc.) and search result set hits classes that are searched from the index.

7. The store-store package is an indexed storage-related class, such as the directory class that defines the storage structure of the index file, Fsdirectory is an indexed storage class stored in the file system (that is, a disk), ramdirectory as an indexed storage class stored in memory. Mmapdirectory is the index storage class that uses memory mappings.
8. The util package Util package is a common utility class, such as a conversion tool between time and string.


v. Setting up the environment

Use Lucene3.6 version, to the official website download Lucene-3.6.0.zip, decompression.
Jars to use:
\lucene-3.6.0\lucene-core-3.6.0.jar------> Lucene Core Package
\lucene-3.6.0\contrib\analyzers\common\lucene-analyzers-3.6.0.jar------> Word breaker
\lucene-3.6.0\contrib\highlighter\lucene-highlighter-3.6.0.jar------> Highlight Keyword Usage
\lucene-3.6.0\contrib\memory\lucene-memory-3.6.0.jar------> Highlight Keyword Usage



Six, sample code

Package com.yulore.lucene;
Import Java.io.File;

Import java.io.IOException;
Import Org.apache.lucene.analysis.Analyzer;
Import Org.apache.lucene.analysis.standard.StandardAnalyzer;
Import org.apache.lucene.document.Document;
Import Org.apache.lucene.document.Field;
Import Org.apache.lucene.document.Field.Index;
Import Org.apache.lucene.document.Field.Store;
Import org.apache.lucene.index.CorruptIndexException;
Import Org.apache.lucene.index.IndexReader;
Import Org.apache.lucene.index.IndexWriter;
Import Org.apache.lucene.index.IndexWriterConfig;
Import Org.apache.lucene.queryParser.MultiFieldQueryParser;
Import org.apache.lucene.queryParser.ParseException;
Import Org.apache.lucene.queryParser.QueryParser;
Import Org.apache.lucene.search.IndexSearcher;
Import Org.apache.lucene.search.Query;
Import Org.apache.lucene.search.ScoreDoc;
Import Org.apache.lucene.store.Directory;
Import Org.apache.lucene.store.FSDirectory;
Import org.apache.lucene.store.LockObtainFailedException; import org.Apache.lucene.util.Version;

Import Org.wltea.analyzer.lucene.IKAnalyzer; public class LuceneDemo02 {/** * @param args * @throws ioexception/public static void main (string[] args) th
		Rows IOException {createindex ();
	Querylucene ("Lucene"); /** * Search by keyword * @param keyword */public static void Querylucene (String keyword) {try {File path = new
			File ("D://luceneex");
			Directory mddirectory = fsdirectory.open (path);
			
			Analyzer Analyzer = new StandardAnalyzer (version.lucene_36);
			
			Indexreader reader = Indexreader.open (mddirectory);
			
Indexsearcher searcher = new Indexsearcher (reader);
Term Term = new Term ("title", keyword);
			Query query = new Termquery (term);
			string[] fields = {"title", "Tag"};
			(search in multiple filed) queryparser Queryparser = new Multifieldqueryparser (version.lucene_36, fields, analyzer);
			
			Query query = queryparser.parse (keyword);	
			
			Long start = System.currenttimemillis (); Scoredoc[] Docs = searcher.search (query, NULL, 3). Scoredocs;
				for (int i=0;docs!=null && i<docs.length;i++) {Document doc = Searcher.doc (docs[i].doc);
				int id = integer.parseint (doc.get ("id"));
				String title = Doc.get ("title");
				String author = doc.get ("author");
				String tag = doc.get ("tag");
				
				String reputation = doc.get ("reputation");
			System.out.println (id+ "+title+" "+author+" "+tag+" "+reputation");
			} reader.close ();
			
			Searcher.close ();
			Long end = System.currenttimemillis ();
			
		System.out.println ("Querylucene time Consuming:" + (End-start) + "MS");
		catch (Corruptindexexception e) {e.printstacktrace ();
		catch (IOException e) {e.printstacktrace ();
		catch (ParseException e) {e.printstacktrace ();
			}/** * CREATE INDEX/private static void CreateIndex () {try {file path = new File ("D://luceneex");
			
			 Directory mddirectory = fsdirectory.open (path); Use the word breaker provided by Lucene/Analyzer Manalyzer = new StandardanAlyzer (version.lucene_36);  
			Use commercial word breaker Analyzer Manalyzer = new Ikanalyzer ();
			
			Indexwriterconfig config = new Indexwriterconfig (version.lucene_36, Manalyzer);
			
			IndexWriter writer = new IndexWriter (mddirectory, config);
			
			Long start = System.currenttimemillis ();
				
				for (int i=0;i<10;i++) {Document doc = new document ();
				Field id = new Field ("id", "10000" +i, Store.yes, index.analyzed);
				Field title = new Field ("title", "Start of Lucene development" +i, Store.yes, index.analyzed);
				Field Author = new Field ("Author", "Yang-rich" +i, Store.yes, index.analyzed);
				Field tag = new Field ("tag", "Lucene, Full-text Search" +i, Store.yes, index.analyzed);
				
				Field reputation = new Field ("Reputation", "a Good book" +i, Store.yes, index.analyzed);
				Doc.add (ID);
				Doc.add (title);
				Doc.add (author);
				Doc.add (tag);
				
				Doc.add (reputation);
			Writer.adddocument (DOC);
			
			} writer.close ();
			Long end = System.currenttimemillis (); System.out.println ("CReateindex time Consuming: "+ (End-start) +" MS ");
		catch (Corruptindexexception e) {e.printstacktrace ();
		catch (Lockobtainfailedexception e) {e.printstacktrace ();
		catch (IOException e) {e.printstacktrace ();
 }
	}
}



Reference http://www.cnblogs.com/bluepoint2009/archive/2012/09/25/introduction-to-lucene36.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.