Lucene finishing--the establishment of indexes

Source: Internet
Author: User

See Lucene homepage (http://lucene.apache.org/) Now Lucene has reached 4.9.0 version number, the reference study book is in accordance with the 2.1 version of the explanation, the code sample written is used 3.0.2 version number, version number

the difference causes some Parties Law of Use the difference, but the same is generally the same.

The jar package (3.0.2 version number) used by the source code

References:

1, the company's internal training materials

2, "Lucene Search engine Development authoritative classic" Yutiann.

Lucene use is very simple, patience to read all can learn, there are source code.

First, the basic way to create an index

The basic architecture and principles of all open source search engines are similar, Lucene is no exception, using it to build a search engine is also to solve the four basic problems: crawling data, parsing data, creating indexes and running searches.

1. Understanding the process of creating an index

There is a very figurative analogy to the creation of indexes in the reference book:

The process of creating an index can be likened to writing a corpus. The following is an example of the writing of the anthology, there are many articles in the anthology, each article contains the title, content, work name, writing time and other information.

We use the following way to write this anthology: first write the article, and then integrate the article.

Write an article by first adding information such as title, content, and writing time to each article.

Then add each article to the book, so the anthology is written.

The structure of the anthology is, for example, seen in the left-to-right direction, reading the anthology, opening a book, and then flipping through the articles. The right-to-left direction is the writing of the anthology.

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvy2fvagfpy2hlbmc=/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/center ">

The steps for creating an index are as follows:

(1), build indexer IndexWriter. This is equivalent to the framework of a book

(2), create Document object documents, which is equivalent to an article

(3), the establishment of information fields Object field, which is equivalent to an article of different information (title, body, etc.).

(4), add field to document.

(5), add document to IndexWriter inside.

(6), close indexer IndexWriter.

For example, as seen in the structure of an index, the index is read from left to right (that is, search). Create an index by right-to-left:


Depending on the structure you see, there are three main steps to creating an index:

(1), create field, wrap up different information of the article

(2), the organization of multiple fields into a document inside. This completes the packaging of an article.

(3), the organization of multiple document into a indexwriter, that is, to assemble multiple articles, and finally form an index

The following three subsections explain the detailed method of creating an index in accordance with the basic steps for creating an index.

2. Create field

There are many ways to create a field. The following are the most frequently used methods.

Field field=new field (field name, field content, storage method, index mode);

The meanings of these four parameters are as follows:

(1), field name is the name of the field. A field name similar to the data table.

(2), field content is the content of the field, similar to the field contents of the database table.

(3), storage method contains three kinds: no storage (Field.Store.NO), total storage (Field.Store.YES) and compressed storage (Field.Store.COMPRESS).

Usually. Suppose you don't worry about indexing too big. Can all be used in a completely stored manner.

But. Due to performance considerations. The smaller the content of the index file, the better. So. Assuming that the field has very little content, it uses a complete storage (such as

, assuming that the content of field is very much in the way of storing or compressing storage, such as the body.

(4), the index of the way contains four kinds:

Do not index (Field.Index.NO), index but not parse (Field.Index.NO_NORMS), index but not participle (Field.Index.UN_TOKENIZED), word breaker, and index (Field.Index.TOKENIZED).

3. Create Document

Create the document method such as the following:

Document Doc=new document ();

This method is used to create an empty document that does not contain any field.

If you want to add field to document, you just need the Add method. such as doc.add (field);

Repeated use allows multiple fields to be added to a document.

4. Create IndexWriter

There are many ways to create IndexWriter, and come up with one that is often used:

<span style= "FONT-FAMILY:SIMSUN;FONT-SIZE:12PX;" ><span style= "FONT-FAMILY:SIMSUN;FONT-SIZE:12PX;" >file Indexdir = new File ("E:\\index");D irectory dir = new Simplefsdirectory (indexdir); Analyzer Analyzer = new StandardAnalyzer (version.lucene_30), indexwriter indexwriter = new IndexWriter (dir, analyzer, True, IndexWriter.MaxFieldLength.LIMITED);</span></span>

Introduction to several parameters:

(1), Directory (folder type) This class is abstract class

Its direct subclass is still different in the version number, let's just say 3.0.2 version number, its subclass contains Fsdirectory, Ramdirectory,fileswitchdirectory,compoundfilereader , two subclasses represent two different folder types

Fsdirectory: A path in the file system that writes an index directly to disk

Ramdirectory: An area in memory where the content disappears after the virtual machine exits, so the contents of the ramdirectory need to be transferred to Fsdirectory.

fileswitchdirectory: One is used to be able to read files in two different folders at the same time fileswitchdirectory, this is a proxy class.

Compoundfilereader: is the compoundfilereader that the user reads the compound file. Only files with the extension CFS can be read. (write file with compoundfilewriter extension cfs) Compoundfilereader is only referenced in Segmentreader.

Among them, Fsdirectory is divided into 3 categories:

A,simplefsdirectory under Windows

B, Linux supports NiO's niofsdirectory

C, the other is the Memory map folder Mmapdirectory

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvy2fvagfpy2hlbmc=/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/center ">

(2), Analyzer (Analyzer) abstract class

Responsible for the analysis of various input data sources, including filtering and word breakers and other functions.

The parser is used for lexical analysis, including English parser and Chinese parser. Select the appropriate parser based on the file condition of the index you want to establish. Frequently used StandardAnalyzer (Standard Analyzer), Cjkanalyzer (binary participle

Chineseanalyzer (Chinese Analyzer), and so on.

The ability to edit the parser according to your own needs to deal with different language text (of course, I will not).

The Document object can be placed in IndexWriter with the Adddocument () method after IndexWriter is created: Writer.adddocument (DOC);

able to place multiple.

Finally, call the close () method to turn off the indexer, such as writer.close ();


Here, the steps to create the index are covered, and we'll look at the examples below:

Second, create a simple index instance:

<span style= "FONT-FAMILY:SIMSUN;FONT-SIZE:12PX;" >import Java.io.file;import Org.apache.lucene.analysis.analyzer;import Org.apache.lucene.analysis.standard.standardanalyzer;import Org.apache.lucene.document.document;import Org.apache.lucene.document.field;import Org.apache.lucene.index.indexwriter;import Org.apache.lucene.store.directory;import Org.apache.lucene.store.simplefsdirectory;import Org.apache.lucene.util.version;public class Lucenemainprocess {public static void main (string[] args) { Createluceneindex ();} public static void Createluceneindex () {try {file Indexdir = new File ("E:\\index");D irectory dir = new Simplefsdirectory (i NDEXDIR); Analyzer Analyzer = new StandardAnalyzer (version.lucene_30),//indexwriter the maximum number of entries in the field that was taken when an index was established for a document, The initial value of this property is 10000IndexWriter IndexWriter = new IndexWriter (dir, analyzer, True, IndexWriter.MaxFieldLength.LIMITED);// Create 8 Documents Document DOC1 = new Document ();D ocument doc2 = new Document ();D ocument doc3 = new Document ();D ocument doc4 = new DocUment ();D ocument doc5 = new Document ();D ocument doc6 = new Document ();D ocument doc7 = new Document ();D ocument doc8 = new D Ocument (); Field f1 = new Field ("BookName", "How Steel is Tempered", field.store.yes,field.index.analyzed); Field F2 = new Field ("BookName", "Heroic Children", field.store.yes,field.index.analyzed); Field F3 = new Field ("BookName", "Fence Woman and Dog", field.store.yes,field.index.analyzed); Field f4 = new Field ("BookName", "Woman is made of water", field.store.yes,field.index.analyzed); Field f5 = new Field ("BookName", "My Brother and Daughter", field.store.yes,field.index.analyzed); Field f6 = new Field ("BookName", "white hair female", field.store.yes,field.index.analyzed); Field F7 = new Field ("BookName", "World of Steel", field.store.yes,field.index.analyzed); Field F8 = new Field ("BookName", "Steel Warrior", field.store.yes,field.index.analyzed);d Oc1.add (F1);d Oc2.add (F2);d Oc3.add (F3 );d Oc4.add (f4);d Oc5.add (F5);d Oc6.add (f6);d Oc7.add (F7);d Oc8.add (f8); Indexwriter.adddocument (Doc1); Indexwriter.adddocument (DOC2); indexwriter.adddocument (DOC3); indexwriter.adddocument (doc4); inDexwriter.adddocument (DOC5); indexwriter.adddocument (DOC6); indexwriter.adddocument (DOC7); Indexwriter.adddocument (DOC8); indexwriter.optimize ();//optimize the index to ensure the speed of retrieval, but need to consume memory and disk space, time consuming, When needed, optimize (not at random) indexwriter.close ();//close indexer, otherwise the index data will be stuck in the cache is not written to disk, it is possible to lock the folder is not removed} catch (Exception e) { E.printstacktrace ();}}} </span>

After the run we found an index folder under the E-disk, which is the indexed file we created.

Next, we'll be able to visually infer whether the index created above is successful by finishing reading the index tomorrow.






Lucene collation--establishment of index

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.