Lucene Getting Started example

Source: Internet
Author: User
Tags commit object model

First, Introduction

What is Lucene: Lucene is a subproject of the Apache Software Foundation Jakarta Project Group, an open source Full-text search Engine toolkit, which is not a full Full-text search engine, but a full-text search engine architecture that provides a complete query engine and indexing engine, Part of the text analysis engine (English and German two Western languages).    The purpose of Lucene is to provide software developers with an Easy-to-use toolkit to facilitate full-text retrieval in the target system, or to build a complete full-text search engine on this basis. Lucene is a full-text search based on Java, not a complete search application, but a code base and API that can easily provide search capabilities for applications. In fact, the function of Lucene is to index a number of strings that are provided by developers, and then provide a Full-text search service, where the user provides the search service with a search service that tells the user the various strings that appear in the keyword.

Second, the basic process

Lucene consists of two parts: indexing and search services. Indexing is an index that writes the source (essentially a string) to the index or deletes the source from the index; Search is to provide the user with Full-text Search service, the user can locate the source through the keyword.

1. Indexing process

Use analyzer to process source strings, including: participle, which is divided into words; Stopword (optional). The valid information in the source is added to the document as a different field, and the document is indexed so that a valid field is recorded in the index. Writes an index to storage (memory or disk). Users to provide search keywords, through analyzer processing. Find the corresponding document for the keyword search index after processing. The user extracts the required field from the found document as needed.

2. Process of retrieval

Users to provide search keywords, through analyzer processing. Find the corresponding document for the keyword search index after processing. The user extracts the required field from the found document as needed. Iii. Basic Concepts

1. Analyzer

The role of analyzer is participle and removes the invalid words in the string.

The purpose of participle is to divide the string into several words according to some semantic rules. English is easier to achieve participle, because the English language itself is the word as a unit, has been separated from the space, and Chinese must be in some way to divide into one sentence into a single word. Invalid words, such as "of", "the" and "the", "the", and "the" in the Chinese, these words appear in a large number of articles. However, it does not contain key information, and it helps to reduce index files, improve hit ratio and execute efficiency.

2. Document

A user-supplied source can be a text file, a string, or a record in a database table, and so on. Once a source string is indexed, it is stored in the index file as a document. The results of the search service are also returned as a list of document.

3. Field

A document can contain multiple information fields, such as "title", "Body", "last Modified", and other information fields, which are stored in the document as field.

Field has two properties: storage and indexing. A storage property can control whether the field is stored, and the indexed property can control whether the field is indexed. This seems superfluous, but the fact is that the correct combination of these two attributes is important.

Here is an example of an article that requires a full-text search of the title and body, so that the indexed properties of the two field are set to True, and you want to be able to extract the title of the article directly from the search results, set the storage property of the title field to True. But the text field is too big, to narrow the index file, set the storage property of the body field to false, and then read the body of the file directly when you need it; you want to be able to extract the last modified time from the search results, but you do not need to search it, so set the storage property of the last modified Time field to True The indexed property is set to False.

The two properties of field prohibit all false cases because it has no meaning for indexing.

4. Segment

When indexing is made, not every document is immediately added to the same index file, which is first written to a different small file and then merged into a large index file, each small file being a segment.

5. Term

Term represents a word for a document and is the smallest unit of search. Term consists of two parts: the expression and the field in which the word appears.

6. Token

Token is a term occurrence that contains trem text and corresponding starting and ending offsets, and a type string. In a sentence, multiple identical words can appear, all with the same term, but with different token, each token the position where the word appears.

Four, the composition structure of Lucene

The Lucene core contains 8 packages: analysis, collation, document, index, Queryparser, search, store, util.

1. Analysis Package (text parsing library)

This library is to parse the text, get the corresponding term object, in index and searcher will be used, in the index process, it will be indexed file parsing, and in the search process, it will parse the query conditions.

Analysis provides a variety of analyzer, such as the Whitespaceanalyzer by the blank character participle, added Stopword filter stopanalyzer, support the smartchineseanalyzer of Chinese participle, And the most commonly used standardanalyzer.

2. Collation Package

A class that contains two identical functions for Collationkeyfilter and Collationkeyanalyzer, converts all token to Collationkey, The Collationkey is encoded with indexablebinarystringtools as a term.

3. Document package (one object model abstract Library of documents)

The document package is a document-related data structure, such as document class, Field class, and so on.

4. Index package (indexed library)

The index package is an indexed read-write operation class that is commonly used to write, merge, and optimize the segment of index files, indexwriter classes, and Indexreader classes that read and delete indexes. IndexWriter only cares about writing indexes to Segment and merging them to optimize; Indexreader focus on the organization of the documents in the index file. It is primarily used to index documents, maintain indexes (update indexes, optimize indexes, and so on), and provide access interfaces for queries.

5. Queryparser Package

The Queryparser package is the class that resolves query statements (commonly used queryparser classes) and token classes.

6. Search package (Query library)

Search packages are a variety of different query classes (such as Termquery, booleanquery, etc.) and search result set hits classes that are searched from the index. It mainly provides the user with the query interface, evaluates the user's query results, and returns the most relevant query information to the user.

7. Store package (repository)

The store package is an indexed storage-related class, such as the directory class that defines the storage structure of the index file, Fsdirectory is an indexed storage class stored in the file system (that is, a disk), ramdirectory as an indexed storage class stored in memory. Mmapdirectory is the index storage class that uses memory mappings. It is primarily used to store indexed content, which provides two types of storage, one for disk storage and the other for memory storage.

8. Util Package (Tool library)

The util package is a common utility class, such as a conversion tool between time and string.

V. Environmental Construction and testing

Two core bundles of Lucene were introduced into the project: Lucene-core-3.6.2.jar and Lucene-core-3.6.2-javadoc.jar;

In the D disk under Windows system, a folder named Lucene is built, a temp folder is built under the Lucene folder, and two folders are built under Temp, named docs (for text files) and index (for indexed files). In the Docs folder to create three TXT files, such as: 1.txt,2.txt,3.txt;3 a txt file to write some content.

1. Index creation:

Package com.javaeye.lucene;
Import Java.io.BufferedReader;
Import Java.io.File;
Import Java.io.FileInputStream;
Import java.io.IOException;
Import Java.io.InputStreamReader;

Import Java.util.Date;
Import Org.apache.lucene.analysis.Analyzer;
Import Org.apache.lucene.analysis.standard.StandardAnalyzer;
Import org.apache.lucene.document.Document;
Import Org.apache.lucene.document.Field;
Import Org.apache.lucene.document.Field.TermVector;
Import Org.apache.lucene.index.IndexWriter;
Import Org.apache.lucene.index.IndexWriterConfig;
Import Org.apache.lucene.index.IndexWriterConfig.OpenMode;
Import Org.apache.lucene.store.Directory;
Import Org.apache.lucene.store.FSDirectory;

Import org.apache.lucene.util.Version; @SuppressWarnings ("deprecation") public class Lucenetest {/** * @param args * @throws ioexception/Public STA tic void Main (string[] args) throws IOException {/** indicates the location where you want to index the folder, this is the D:\lucene\temp\docs folder under */File F Iledir=new File ("D:" +file.separator+"Lucene" +file.separator+ "temp" +file.separator+ "docs"); /** the location of the index file */file Indexdir=new file ("D:" +file.separator+ "Lucene" +file.separator+ "temp" +file.separator+ "index
             ");
             Analyzer luceneanalyer=new StandardAnalyzer (version.lucene_36);
             Directory Dir=fsdirectory.open (Indexdir);
             Indexwriterconfig iwc=new Indexwriterconfig (Version.lucene_36,luceneanalyer);
             Iwc.setopenmode (openmode.create);
             IndexWriter indexwriter=new IndexWriter (DIR,IWC);
             When the index library is not established and there is no index file, first commit to let him build an index library version information, if the first time without a commit to open an index reader will be reported abnormal//indexwriter.commit ();
             File[] Textfiles=filedir.listfiles ();
             	Long Starttime=new Date (). GetTime (); Add document to Index to go for (int i=0;i<textfiles.length;i++) {if (Textfiles[i].isfile () && te Xtfiles[i].getname (). EndsWith (". txt")) {System.out.println ("File" +textfiles[i].getcanoniCalpath () + "is being indexed."
            		 ");
            		 String Temp=filereaderall (Textfiles[i].getcanonicalpath (), "GBK");
            		 SYSTEM.OUT.PRINTLN (temp);
            		 Document Document=new document ();
            		 Create a Field Fields field Fieldpath=new field ("Path", Textfiles[i].getpath (), field.store.yes,field.index.no);
            		 Field Fieldbody=new field ("Body", temp,field.store.yes,field.index.analyzed,termvector.with_positions_offsets);
            		 Document.add (Fieldpath);
            		 Document.add (Fieldbody);
            		 
            	 Indexwriter.adddocument (document);
             } indexwriter.close ();
             Long Endtime=new Date (). GetTime (); System.out.println ("This takes A + (Endtime-starttime) +" millisecond to add the document to the index.
             
	"+filedir.getpath ()); public static string Filereaderall (String filename,string charset) throws ioexception{BufferedReader reader=new Buffe Redreader (New InputStreamReader, new FileInputStream (filename), CharSet));
		String Line=new string ();
		String Temp=new string ();
		while ((Line=reader.readline ())!=null) {temp+=line;
		} reader.close ();
	return temp;
 }

}

Run Result:

Filed:\lucene\temp\docs\1.txt is being indexed.
Give me your love.
Filed:\lucene\temp\docs\2.txt is being indexed.
Love Comes so delicate-yeah!
Filed:\lucene\temp\docs\3.txt is being indexed.
Because to protect our love ...
It took 419 milliseconds to add the document to the index. D:\lucene\temp\docs
2. After the index has been established, the query

Package com.javaeye.lucene;
Import Java.io.File;

Import java.io.IOException;
Import Org.apache.lucene.analysis.Analyzer;
Import Org.apache.lucene.analysis.standard.StandardAnalyzer;
Import Org.apache.lucene.index.IndexReader;
Import org.apache.lucene.queryParser.ParseException;
Import Org.apache.lucene.queryParser.QueryParser;
Import Org.apache.lucene.search.IndexSearcher;
Import Org.apache.lucene.search.Query;
Import Org.apache.lucene.search.ScoreDoc;
Import Org.apache.lucene.search.TopDocs;
Import Org.apache.lucene.store.FSDirectory;

Import org.apache.lucene.util.Version; public class Querytest {/** * @param args */@SuppressWarnings ("deprecation") public static void Main (string[] Ar
         GS) throws Ioexception,parseexception {String index= "D:\\lucene\\temp\\index";
         File File=new file (index);
         if (file!=null) {System.out.println (File.getcanonicalpath ());
       } Indexreader Reader=indexreader.open (Fsdirectory.open (file));  Indexsearcher searcher=new indexsearcher (reader);
         Scoredoc[] Hits=null;
         String querystr= "Love";/search keyword Query query=null;
        Analyzer analyzer=new StandardAnalyzer (version.lucene_36);
        	  try{queryparser qp=new queryparser (version.lucene_36, "body", analyzer);
         Query=qp.parse (QUERYSTR);
         }catch (ParseException e) {e.printstacktrace ();
        	 } if (searcher!=null) {Topdocs results=searcher.search (query,10);
        	 Hits=results.scoredocs;
            	 if (hits.length>0) {System.out.println ("find" +hits.length+ "result");
            	 for (int i=0;iRun Result:

D:\lucene\temp\index
3 results found
Love Comes so delicate-yeah!
Because to protect our love ...
Give me your love.
the test completed successfully .....

Ps:

Lucene is actually very simple, it is mainly to do two things: indexing and search
To see some of the terminology used in Lucene, this is not a detailed description, just a little bit----because there is a good thing in this world, called search.

One of the most important classes in Indexwriter:lucene, which is used primarily to index documents and to control some of the parameters used during indexing.

Analyzer: Analyzer, mainly used to analyze the various texts encountered by search engines. Commonly used are StandardAnalyzer Analyzer, Stopanalyzer Analyzer, Whitespaceanalyzer Analyzer and so on.

Directory: The location where the index is stored; Lucene provides a location for two types of indexes, one for disk and one for memory. The index is generally placed on disk, and Lucene provides the fsdirectory and ramdirectory two classes.

Document: Documentation is the equivalent of a unit to index, and any file that you want to index must be converted to a Document object for indexing.

Field: Fields.

Indexsearcher: Is the most basic search tool in Lucene, all the search will use the Indexsearcher tool;

Query: Lucene support fuzzy query, semantic query, phrase query, combination query, etc., such as termquery,booleanquery,rangequery,wildcardquery and other classes.

Queryparser: is a tool for parsing user input, which can be used to generate query objects by scanning the string entered by the user.

Hits: After the search is complete, you need to return the search results and display it to the user, which is the only way to complete the search. In Lucene, the collection of the results of the search is represented by an instance of the Hits class.





Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.