Simple implementation of lucene

Last Update:2013-11-18 Source: Internet

Author: User

Tags createindex

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In daily development, we believe that we often use like to match some data. We also know that like often leads to full table scanning. When the data volume grows, we will be entangled in

Search for the database's turtle speed. Now we have to find another way. Then lucene will be able to show its talents.

First, we will make a demo to insert 10 million data records into the database, totaling 778 MB.

Next, we will search for the "popular" record in the news content.

Mmd: It takes 78 s to search for the database. All of them have to hack into the database.

Let's take a look at how lucene works. : Http://incubator.apache.org/lucene.net/download.html

The Code is as follows:

Copy code

Using System;
Using System. Collections. Generic;
Using System. Linq;
Using System. Text;
Using Lucene. Net. Index;
Using Lucene. Net. Store;
Using Lucene. Net. Analysis. Standard;
Using Lucene. Net. Documents;
Using System. Data;
Using System. Diagnostics;
Using Lucene. Net. Search;

Using Lucene. Net. QueryParsers;

Namespace First
{
Class Program
{
Static string path = @ "D: Sample ";

Static void Main (string [] args)
{
// Create an index
CreateIndex ();

Var watch = Stopwatch. StartNew ();

// Search
IndexSearcher search = new IndexSearcher (path );

// Query expression
QueryParser query = new QueryParser (string. Empty, new StandardAnalyzer ());

// Query. parse: inject query Conditions
Var hits = search. Search (query. Parse ("Content: popular "));

For (int I = 0; I {
Console. writeLine ("current content: {0}", hits. doc (I ). get ("Content "). substring (0, 20) + "... ");
}

Watch. Stop ();

Console. WriteLine ("search time consumed: {0}", watch. ElapsedMilliseconds );
}

Static void CreateIndex ()
{
// Create the index library directory
Var directory = FSDirectory. GetDirectory (path, true );

// Create an index and use StandardAnalyzer to split sentences
IndexWriter indexWriter = new IndexWriter (directory, new StandardAnalyzer ());

Var reader = DbHelperSQL. ExecuteReader ("select * from News ");

While (reader. Read ())
{
// A set of fields: a document, similar to a table row
Document doc = new Document ();

// The field to be indexed
Doc. Add (new Field ("ID", reader ["ID"]. ToString (), Field. Store. YES, Field. Index. NOT_ANALYZED ));
Doc. Add (new Field ("Title", reader ["Title"]. ToString (), Field. Store. NO, Field. Index. ANALYZED ));
Doc. Add (new Field ("Content", reader ["Content"]. ToString (), Field. Store. YES, Field. Index. ANALYZED ));

IndexWriter. AddDocument (doc );
}

Reader. Close ();

// Optimize the index file
IndexWriter. Optimize ();

IndexWriter. Close ();
}
}
}

I relied on 448 ms and suddenly lost my S. Of course, this time does not include the time for "creating indexes". In terms of time complexity, this type of pre-loaded index is a constant.

As a beginner, we will briefly introduce the implementation process of lucene. lucene is divided into two steps: "Index" and "Search ".

I. Index:

I believe everyone is familiar with indexing. lucene can split our content into many words, use words as keys, create inverted indexes, and put them in the index database.

In this example, we can see that IndexWriter, FSDirectory, StandardAnalyzer, Document, and Field classes are used in the indexing process. The following is a brief analysis.

1: IndexWriter

We can see that this class has an AddDocument method, so we think this class implements the index write operation.

2: FSDirectory

This is simpler. It provides the storage location of the index database. For example, if we use D: Sample here, someone may ask whether the database can be stored in the memory. In the face of powerful lucene, of course

Yes, RAMDirectory in lucene can be implemented. Of course, if our memory is large enough, we can still use the memory to carry the index library, thus improving the search efficiency.

3: StandardAnalyzer

This is the most critical step in the indexing process. It is also something we think very carefully about when using lucene. The key reason why we can search for the second kill is how to extract the input content.

In what form of segmentation, of course, different splitting forms produce different analyzer, StandardAnalyzer is a Analyzer Based on word segmentation.

Chapter.

4: Document

In the above example, we can see that it is a set that carries field and then added to IndexWriter, which is similar to the concept of rows in the table.

5: Field

The processing of the fields to be analyzed is presented in KV format.

①: Field. Store. YES, Field. Index. NOT_ANALYZED indicates that the Index Field is saved as is and not split by StandardAnalyzer.

②: Field. Store. NO, Field. Index. ANALYZED is not saved but must be split by StandardAnalyzer.

Ii. Search

This is relatively easy. According to the words we entered, lucene can quickly locate the words we are looking for in the index library. We can also see IndexSearcher, QueryParser, and Hits.

1: IndexSearcher

This can be understood as opening the index library created by IndexWriter in a read-only manner. search provides QueryParser with a bridge to query.

2: QueryParser

This provides a parse method to convert the words we want to search into lucene which can understand the query expression.

3: Hits

This is a pointer to obtain the matching result. The advantage is similar to the delay loading in C #. The goal is the same and the performance is improved.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More