Lucene sorting-index creation

Last Update:2014-07-01 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

See Lucene home page (http://lucene.apache.org/) on the current Lucene has to 4.9.0 version, refer to the learning book is explained in accordance with version 2.1, write the code example is used 3.0.2 version, version

Different causes some methods to be used differently, but they are basically the same.

Jar package used by source code (version 3.0.2)

References:

1. company internal training materials

2. Tian nun, an authoritative Lucene search engine development classic.

Lucene is easy to use and can be learned after reading it patiently, as well as the source code.

I. Basic Indexing Methods

The basic architecture and principles of all open-source search engines are similar, and Lucene is no exception. using it to build a search engine is also the four basic problems to be solved: capture Data, parse data, create indexes, and perform search.

1. Understand the process of creating an index

In the reference books, there is a metaphor for creating an index:

The process of creating an index can be analogous to writing a collection. The following describes how to write a collection. The collection contains many articles, including the title, content, name, and time of writing.

We use the following method to write this collection: Write an article first, and then integrate the article.

First, add the title, content, and writing time of each article to write an article.

Then add each article to the book, so that the collection is well written.

Shows the structure of the Collection: In the left-to-right direction, it is to read the collection, that is, to open a book, and then read the articles in it; in the right-to-left direction, it is to write the collection.

The process of creating an index is as follows:

(1) Create indexwriter, which is equivalent to the framework of a book.

(2) create a Document Object document, which is equivalent to an article

(3) create an Information Field object field, which is equivalent to different information in an article (title, body, etc ).

(4) Add the field to the document.

(5) Add the document to indexwriter.

(6) Disable indexwriter.

As shown in, the index structure is read from left to right (that is, search ). Create an index from right to left:

Follow these steps to create an index:

(1) create a field and package different information of the article.

(2) organize multiple fields into a document to package an article.

(3) organize multiple documents into one indexwriter, that is, assemble multiple articles to form an index.

The following three sections describe how to create an index based on the basic steps for creating an index.

2. Create a field

There are many methods to create a field. The following are the most common methods.

Field field = new field (field name, field content, storage method, indexing method );

The meanings of these four parameters are as follows:

(1) The field name is the field name, similar to the field name of the data table.

(2) The field content is the content of the field, similar to the field content of the database table.

(3) There are three storage methods: field. Store. No, field. Store. Yes, and field. Store. compress ).

Generally, you can use full storage if you don't worry about the index being too large. However, for performance considerations, the smaller the index file content, the better. Therefore, if the field content is very small, it uses full storage (such

If there is a lot of field content, it will be stored in a way that is not stored or compressed, such as text.

(4) There are four indexing methods:

No index (field. index. no), index, but not analysis (field. index. no_norms), index, but not word segmentation (field. index. un_tokenized), word segmentation, and index (field. index. tokenized ).

3. Create a document

The document creation method is as follows:

Document Doc = new document ();

This method is used to create an empty document without any field.

If you want to add the field to the document, you only need to add the method. For example, Doc. Add (field );

You can add multiple fields to a document after repeated use.

4. Create indexwriter

There are also many ways to create indexwriter. A common method is as follows:

<span style="font-family:SimSun;font-size:12px;"><span style="font-family:SimSun;font-size:12px;">File indexDir = new File("E:\\Index");Directory dir = new SimpleFSDirectory(indexDir);Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);IndexWriter indexWriter = new IndexWriter(dir, analyzer, true, IndexWriter.MaxFieldLength.LIMITED);</span></span>

Introduction to several parameters:

(1) Directory (directory type) is an abstract class

Its direct subclass is different in different versions. Let's talk about version 3.0.2. Its subclass includes fsdirectory, ramdirectory, fileswitchdirectory, and compoundfilereader. The two subclasses represent different directory types.

Fsdirectory: a path in the file system that directly writes the index to the disk.

Ramdirectory: a region in the memory. The content disappears after the VM exits. Therefore, you need to transfer the content in ramdirectory to fsdirectory.

Fileswitchdirectory: one is the fileswitchdirectory that can read files in two different directories at the same time. This is a proxy class.

Compoundfilereader: The compoundfilereader that reads the compoundfilereader of the composite file. Only files with the CFS extension can be read. (Compoundfilewriter is used to write files with the CFS extension.) compoundfilereader is only referenced in segmentreader.

Fsdirectory is divided into three categories:

A. simplefsdirectory in Windows

B. Linux supports NiO niofsdirectory

C. Another is the memory map directory mmapdirectory.

(2) abstract class of analyzer

Analyzes various input data sources, including filtering and word segmentation.

Analyzer is used for lexical analysis, including English analyzer and Chinese analyzer. Select an appropriate Analyzer Based on the file of the index to be created. Commonly used are standardanalyzer (Standard analyzer) and cjkanalyzer (binary segmentation ).

And chineseanalyzer (Chinese analyzer.

You can edit the analyzer as needed to process different languages (of course, I won't ).

After indexwriter is created, you can use adddocument () to place the Document Object in indexwriter: writer. adddocument (DOC );

You can place multiple.

Finally, you need to call the close () method to close the indexer, for example, writer. Close ();

At this point, the procedure for creating an index is complete. Next we should check the instance:

2. Create a simple index instance:

<Span style = "font-family: simsun; font-size: 12px;"> Import Java. io. file; import Org. apache. lucene. analysis. analyzer; import Org. apache. lucene. analysis. standard. standardanalyzer; import org.apache.e.doc ument. document; import org.apache.e.doc ument. field; import Org. apache. lucene. index. indexwriter; import Org. apache. lucene. store. directory; import Org. apache. lucene. store. simplefsdirectory; import Org. apache. lucene. util. version; public class extends emainprocess {public static void main (string [] ARGs) {createluceneindex ();} public static void createluceneindex () {try {file indexdir = new file ("E: \ Index "); directory dir = new simplefsdirectory (indexdir); analyzer = new standardanalyzer (version. required e_30); // indexwriter is the maximum number of entries in the field obtained when an index is created for a document. The initial value of this attribute is 10000 indexwriter = new indexwriter (Dir, analyzer, true, indexwriter. maxfieldlength. limited); // create eight documents: Document doc1 = new document (); document doc2 = new document (); document doc3 = new document (); document doc4 = new document (); document doc5 = new document (); document doc6 = new document (); document doc7 = new document (); document doc8 = new document (); field F1 = new field ("bookname", "How steel is made", field. store. yes, field. index. analyzed); field F2 = new field ("bookname", "Heroes", field. store. yes, field. index. analyzed); field F3 = new field ("bookname", "Awesome women and dogs", field. store. yes, field. index. analyzed); field F4 = new field ("bookname", "Women are made of water", field. store. yes, field. index. analyzed); field F5 = new field ("bookname", "my brother and daughter", field. store. yes, field. index. analyzed); field F6 = new field ("bookname", "White Haired Girl", field. store. yes, field. index. analyzed); field F7 = new field ("bookname", "Steel world", field. store. yes, field. index. analyzed); field F8 = new field ("bookname", "Iron warrior", field. store. yes, field. index. analyzed); doc1.add (F1); doc2.add (F2); doc3.add (F3); doc4.add (F4); doc5.add (F5); doc6.add (F6); doc7.add (F7 ); doc8.add (F8); indexwriter. adddocument (doc1); indexwriter. adddocument (doc2); indexwriter. adddocument (doc3); indexwriter. adddocument (doc4); indexwriter. adddocument (doc5); indexwriter. adddocument (doc6); indexwriter. adddocument (doc7); indexwriter. adddocument (doc8); indexwriter. optimize (); // The index is optimized to ensure the retrieval speed, but the memory and disk space are consumed, which consumes time and effort and is optimized (not at any time) indexwriter. close (); // close the indexer. Otherwise, the index data will be stuck in the cache and not written to the disk. The directory lock may not be removed.} catch (exception E) {e. printstacktrace () ;}}</span>

After the execution, we found an index directory under the e disk, which is the index file we created.

At this point, the index file is created.

Tomorrow we will sort down the index reading, and then we will be able to intuitively determine whether the index created above is successful.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Lucene sorting-index creation

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support