Lucene the process of building an index

Source: Internet
Author: User

----:  CSDN Blog   original  http://blog.csdn.net/caohaicheng/article/details/35992149


See Lucene homepage (http://lucene.apache.org/) currently Lucene has reached 4.9.0 version, the reference study book is in accordance with the 2.1 version of the explanation, written code example is used 3.0.2 version, version

The difference in the use of some methods, but generally the same.

Jar Package for source code (3.0.2 version)

Resources:

1, the company's internal training materials

2, "Lucene Search engine Development authoritative classic" Yutiann.

Lucene use is very simple, patience to read all can learn, there are source code.

First, the basic way to create an index

The basic architecture and principles of all open source search engines are similar, Lucene is no exception, using it to build a search engine is also to solve the four basic problems: crawling data, parsing data, creating indexes and performing searches.

1. Understanding the process of creating an index

There is an image metaphor for the creation of indexes in reference books:

The process of creating an index can be likened to writing an anthology. The following is an example of the writing of the anthology, the collection contains many articles, each article includes the title, content, work name, writing time and other information.

We use the following methods to write this anthology: first write the article, and then integrate the article.

Start by adding a title, content, writing time and other information to each article to write an article.

Then add each article to the book, so the anthology is written.

The structure of the anthology is as follows: In the left-to-right direction, read the anthology, open a book, and then flip through the articles in the right-to-left direction.

The process for creating an index is as follows:

(1), build indexer indexwriter, which is equivalent to the framework of a book

(2), create Document object documents, which is equivalent to an article

(3), the establishment of information fields Object field, which is equivalent to an article of different information (title, body, etc.).

(4), add field to document.

(5), add document to IndexWriter inside.

(6), close indexer IndexWriter.

As shown in the structure of an index, the index is read from left to right (that is, search). Create an index by right-to-left:

Following the structure shown, there are three basic steps to creating an index:

(1), create field, wrap up different information of the article

(2), a number of fields organized into a document inside, so completed the packaging of an article.

(3), the organization of multiple document into a indexwriter, that is, to assemble multiple articles, eventually forming an index

The following three subsections follow the basic steps for creating an index to explain how to create an index.

2. Create field

There are many ways to create a field, and the following are the most common methods.

Field field=new field (field name, field content, storage method, index mode);

The meanings of these four parameters are as follows:

(1), field name is the name from field, similar to the field name of the data table.

(2), field content is the content of the field, similar to the field contents of the database table.

(3), storage methods include three kinds: no storage (Field.Store.NO), full storage (Field.Store.YES) and compressed storage (Field.Store.COMPRESS).

In general, you can use full storage if you are not concerned about the size of the index. However, for performance reasons, the content of the index file is as small as possible. Therefore, if field content is rarely used for full storage (such as

If the field is a lot of content, it does not store or compress the storage, such as the body.

(4), the index of the way includes four kinds:

Do not index (Field.Index.NO), index but not parse (Field.Index.NO_NORMS), index but not participle (Field.Index.UN_TOKENIZED), word breaker, and index (Field.Index.TOKENIZED).

3. Create Document

The document method is created as follows:

Document Doc=new document ();

This method is used to create an empty document that does not contain any field.

If you want to add field to document, you only need the Add method. For example doc.add (field);

Repeated use allows multiple fields to be added to a document.

4. Create IndexWriter

There are many ways to create IndexWriter, and come up with a common one:

<span style= "font-family: simsun;font-size:12px; " ><span style= " font-family:simsun;font-size:12px; " >file Indexdir = new File ("E:\\index");    Directory dir = new Simplefsdirectory (indexdir); Span class= "indent" >   Analyzer Analyzer = new StandardAnalyzer ( VERSION.LUCENE_30);    IndexWriter IndexWriter = new IndexWriter (dir, Analyzer, True, IndexWriter.MaxFieldLength.LIMITED); </span></SPAN>   

Introduction to several parameters:

(1), directory (directory type) This class is abstract class

Its direct subclass is still different in various versions, let's just say 3.0.2, its subclasses include,, Fileswitchdirectory, Compoundfilereader, and two subclasses represent different directory types

Fsdirectory: A path in the file system that writes an index directly to disk

Ramdirectory: An area in memory where the content disappears after the virtual machine exits, so you need to go to fsdirectory in the ramdirectory.

Fileswitchdirectory: One is for a fileswitchdirectory that can read a file in two different directories at the same time, which is a proxy class.

Compoundfilereader: Is the compoundfilereader that the user reads the compound file and can only read files with the file name extension CFS. (write file with compoundfilewriter extension cfs) Compoundfilereader is only referenced in Segmentreader.

Among them, Fsdirectory is divided into 3 categories:

A, simplefsdirectory under Windows

B, Linux supports NiO's niofsdirectory

C, there is also a memory map directory Mmapdirectory

(2), Analyzer (Analyzer) abstract class

Responsible for the analysis of various input data sources, including filtering and word breakers and other functions.

The parser is used for lexical analysis, including English parser and Chinese parser. To select the appropriate parser based on the file condition of the index to be established. Commonly used are StandardAnalyzer (Standard Analyzer), Cjkanalyzer (dichotomy of two methods)

Chineseanalyzer (Chinese Analyzer), and so on.

You can edit the parser according to your own needs to handle different language text (I won't, of course).

IndexWriter the Document object can be placed in IndexWriter with the Adddocument () method after the creation is complete: writer.adddocument (DOC);

You can place more than one.

Finally, you want to call the close () method to close the indexer, such as writer.close ();

Here, the steps to create the index are done, so let's look at the examples below:

Second, create a simple index instance:

<span style="FONT-FAMILY:SIMSUN;FONT-SIZE:12PX;" >Import Java.io.File;Import Org.apache.lucene.analysis.Analyzer;Import Org.apache.lucene.analysis.standard.StandardAnalyzer;Import org.apache.lucene.document.Document;Import Org.apache.lucene.document.Field;Import Org.apache.lucene.index.IndexWriter;Import Org.apache.lucene.store.Directory;Import Org.apache.lucene.store.SimpleFSDirectory;Import org.apache.lucene.util.Version;PublicClasslucenemainprocess { PublicStaticvoid Main (string[] args) { Createluceneindex ();} PublicStaticvoid Createluceneindex () {  try {  File Indexdir =New File ("E:\\index");  Directory dir =New Simplefsdirectory (Indexdir);  Analyzer Analyzer =New StandardAnalyzer (VERSION.LUCENE_30);   IndexWriter the maximum number of entries in the field that are taken when an index is indexed for a document, the initial value of this property is 10000  IndexWriter IndexWriter =New IndexWriter (dir, analyzer,True, IndexWriter.MaxFieldLength.LIMITED);   Create 8 of documents  Document Doc1 =New Document ();  Document doc2 =New Document ();  Document DOC3 =New Document ();  Document doc4 =New Document ();  Document DOC5 =New Document ();  Document DOC6 =New Document ();  Document Doc7 =New Document ();  Document Doc8 =New Document ();  Field F1 =New Field ("BookName","How Steel is Tempered", Field.Store.YES,    Field.Index.ANALYZED);  Field F2 =New Field ("BookName","Heroes and Sons", Field.Store.YES,    Field.Index.ANALYZED);  Field F3 =New Field ("BookName","Fence woman and Dog", Field.Store.YES,    Field.Index.ANALYZED);  Field F4 =New Field ("BookName","Women are made of water," Field.Store.YES,    Field.Index.ANALYZED);  Field f5 =New Field ("BookName","My brother and daughter," Field.Store.YES,    Field.Index.ANALYZED);  Field F6 =New Field ("BookName","White-Hairy Woman", Field.Store.YES,    Field.Index.ANALYZED);  Field F7 =New Field ("BookName","The World of Steel", Field.Store.YES,    Field.Index.ANALYZED);  Field F8 =New Field ("BookName","Steel Warrior", Field.Store.YES,    Field.Index.ANALYZED);  Doc1.add (F1);  Doc2.add (F2);  Doc3.add (F3);  Doc4.add (F4);  Doc5.add (F5);  Doc6.add (f6);  Doc7.add (F7);  Doc8.add (f8);  Indexwriter.adddocument (Doc1);  Indexwriter.adddocument (DOC2);  Indexwriter.adddocument (DOC3);  Indexwriter.adddocument (DOC4);  Indexwriter.adddocument (DOC5);   Indexwriter.adddocument (DOC6);    Indexwriter.adddocument (DOC7);    Indexwriter.adddocument (DOC8);   Indexwriter.optimize ();  Optimize the index to ensure the speed of retrieval, but need to consume memory and disk space, time consuming, and optimize when needed (not at any time)   Indexwriter.close (); //close indexer, otherwise it will cause the index data to remain in the cache is not written to the disk, it is possible that even the directory of the lock is not removed   span class= "keyword" >catch (Exception e) {   E.printstacktrace ();  }}}</span>      

After the execution we found an index directory under the E disk, which is the indexed file we created.

By this, the index file is created.

Tomorrow we'll go through the index reading, and then we'll be able to visually determine if the index created above is successful.

Lucene the process of building an index

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.