DotLucene search engine Demo: Create an index

Source: Internet
Author: User
DotLucene search engine Demo: this is an official website for index creation. DotLucene is a powerful search engine designed specifically for NET! He also has an Online demo, which takes about 0.1 seconds to search for 3. 5 GB text data! You can click here to test. I also remember that on one of my websites
99 favorites (Note 1) has an online help, which uses StreamReader to read text data, and others are read databases, I found that whether you read the xml document in the form of a database or an xml, no matter how optimized your database is or how high your machine is, the speed of reading is incomparable to the speed of reading text data. You can test it at http://www.99scj.com. Click online help and it will pop up.
This article is based on a demo on the official DotLucene website,
1. the demo in this article uses the latest stable version 1.4.3 of DotLucene.
2. the development environment is vs2005.
3. I have divided the demo into two parts. One part is a console program, which is described in this article. It is mainly about how to create an index and the other
One part is a web program. The key is to search for the index created in this part.
4. The source code will be downloaded in the next section because the two sections share the same solution.
Now, how can we use DotLucene to create indexes.
What is an index? I don't quite understand either. I understand this. indexing is used to speed up data query. For example, when we were a child, we had the first lesson in front of the textbook: what ...... Page number. This should be the index. Using DotLucene to create an index means that some files are indexed into a directory.
Run vs2005, select File-Create Project, select Visual Studio solution in the pop-up dialog box, select a blank solution on the right, and enter the name SearchDemo, select D: \ for the position.
Right-click solution SearchDemo and choose "add"> "select"> "create solution folder". Enter the folder name as Indexer. I will find d: \ SearchDemo, and create a directory named wwwroot in this folder. We will know that this directory is for weB programs. We will create a new virtual directory in the iis manager, point to the d: \ SearchDemo \ wwwroot directory named SearchDemo.
Right-click the vs solution SearchDemo in a single quarter and choose add -- create a solution folder. Enter the folder name as web. In fact, the two folders are named vs virtual, but they do not exist. Right-click the Indexer of the first project and choose add = new project. Then, select v c # -- windows on the left of the pop-up panel, select the console application on the right, and enter the Indexer name, OK. At this time, vs will add an Indexer directory under the d: \ searchdemo directory, then, right-click the newly added web project and choose "add"> "existing website"> "SearchDemo.
Now we have created two projects, one console project and one SearchDemo web project. This part only describes how to create an index for the first project and how to create an index, first, we must understand where the index should be built? To help me create an index under the D: \ SearchDemo \ wwwroot directory, we must also understand which files will be indexed? For convenience, I also put the file to be indexed in the d: \ SearchDemo \ wwwroot directory to create a new documents directory, that is to say, all files under the documents directory will be indexed by me. This demo demonstrates how to search for the help document of DotLucene. in literature and art, we upload all the downloaded help document files to the d: \ SearchDemo \ wwwroot \ documents directory. At the same time, we must grant the index DIRECTORY write right.
Now we add reference Lucene. Net. dll to The Indexer console project.
We now add a class in the Indexer console project: IntrnetIndexer. cs;
First describe doc. Add (Field. UnStored ("text", ParseHtml (html )));
Doc. Add (Field. Keyword ("path", relativePath ));
Doc. Add (Field. Text ("title", GetTitle (html )));
The index is composed of Docuemnt objects, and the Docuemnt object is composed of field objects.
Field. the UnStored method is described on its official website as Constructs a String-valued Field that is tokenized and indexed, but that is not stored in the index. term vector will not be stored for this Field. the Eunge with a search price of 110 is translated as follows: To construct a String-type field, it will be segmented and indexed, but it will not be stored in the index. The word vectors of this field will not be stored, and I have never been able to understand the meaning of the word vectors of this field, Khan.
The Code is as follows:

Using System;
Using System. Collections. Generic;
Using System. Text;
Using System. IO;
Using System. Text. RegularExpressions;
Using Lucene. Net. Analysis. Standard;
Using Lucene. Net. Documents;
Using Lucene. Net. Index;

Namespace Indexer
{
Public class IntranetIndexer
{
// Index writer
Private IndexWriter writer;
// Root directory of the file to which the index is to be written
Private string docRootDirectory;
// File format to be matched
Private string pattern;
/// <Summary>
/// Initialize an index writer. directory is the directory where the index is created. true indicates that if the index file does not exist, the index file will be re-created. If the index file already exists, the index file will be overwritten, true indicates that an existing index file is opened.
/// </Summary>
/// <Param name = "directory"> specifies the directory for which the index is to be created. Note that the directory is a string value. If the directory does not exist, it is automatically created. </param>
Public IntranetIndexer (string directory)
{
Writer = new IndexWriter (directory, new StandardAnalyzer (), true );
Writer. SetUseCompoundFile (true );
}
Public void AddDirection (DirectoryInfo directory, string pattern)
{
This. pattern = pattern;
This.doc RootDirectory = directory. FullName;
AddSubDirectory (directory );
}
Private void AddSubDirectory (DirectoryInfo directory)
{
Foreach (FileInfo fi in directory. GetFiles (pattern ))
{
// Traverse all the files in the directory to which the index is to be written, add it to the Docuemnt object first, and then the index, because the index is composed of Document objects.
AddHtmlToDocument (fi. FullName );
}
Foreach (DirectoryInfo di in directory. GetDirectories ())
{
// Layer-by-layer traversal and recursion, only to complete all sub-directories and sub-Files
AddSubDirectory (di );
}
}
Private void AddHtmlToDocument (string path)
{
Document doc = new Document ();
String html;
Using (StreamReader sr = new StreamReader (path, System. Text. Encoding. Default ))
{
Html = sr. ReadToEnd ();
}
Int relativePathStartsAt = this.doc RootDirectory. EndsWith ("\\")? This.doc RootDirectory. Length: this.doc RootDirectory. Length + 1;
String relativePath = path. Substring (relativePathStartsAt );

Doc. Add (Field. UnStored ("text", ParseHtml (html )));
Doc. Add (Field. Keyword ("path", relativePath ));
Doc. Add (Field. Text ("title", GetTitle (html )));
Writer. AddDocument (doc );
}
/// <Summary>
/// Remove all html tags from the Read File and replace & nbsp; with spaces
/// </Summary>
/// <Param name = "html"> </param>
/// <Returns> </returns>
Private string ParseHtml (string html)
{
String temp = Regex. Replace (html, "<[^>] *> ","");
Return temp. Replace ("& nbsp ;","");
}
/// <Summary>
/// Obtain the title of the html document to be read
/// </Summary>
/// <Param name = "html"> </param>
/// <Returns> </returns>
Private string GetTitle (string html)
{
Match m = Regex. Match (html, "<title> (. *) </title> ");
If (m. Groups. Count = 2)
Return m. Groups [1]. Value;
Return "the title of this document is unknown ";
}

Public void Close ()
{
Writer. Optimize ();
Writer. Close ();
}
}
}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.