Full-Text Search
After a few days of busy finally realize a simple full-text search here's a recap
This article describes what Lucene.Net is? What can lucene.net do? And how to do the problem? Finally, an example of lucene.net implementation of Full-text search is given.
1, what is lucene.net?
Lucene.Net was initially an Open-source project and then turned to commercialization, also in Lucene.Net 2.0 has been released, but to Money D, Lucene.Net's fate is somewhat similar to that of Freetextbox, which was released after 1.6.5 Release 2.0 Started the business route, 2.0 provided a free version of the DLL, the source code version must purchase a commercial license licence, but it left a 1.6.5 version of the source code, or can see most of the internal details, but the 2.0 version of the added to the Mozilla The Support section of the browser is only spy by the HTML and JavaScript scripts it generates.
Lucene is a commonly used indexing API in the Java world, and it provides a way to create indexes for text data and to provide retrieval. (Reference: Nlucene and Lucene.) NET) Nlucene is the first. Net migration and a. NET-style version, using the. NET naming conventions and class library design. However, the Nlucene project leader for energy reasons, only released a 1.2beta version. After the Lucene.Net project appeared, Nlucene had no new plans.
Lucene.Net, who originally claimed to be a up-to-date. NET Lucene transplant, only adopted. NET's recommendations in terms of naming, and the main goal is to be compatible with Java Lucene: one is indexed format compatible to achieve the goal of working together; one is named Close (only A small difference, such as capitalization, is designed to make it easier for developers to use Java-Lucene-related code and data.
I don't know when the Lucene.Net project has abandoned the open source plan and turned to business. It has also deleted the SourceForge file that has been open source. At the same time, the Dotlucene project appeared on the SourceForge, and in protest against Lucene.Net, Dotlucene almost left the lucene.net code intact as their starting point. (https://sourceforge.net/forum/forum.php?thread_id=1153933&forum_id=408004).
Plainly lucene.net is a library of information retrieval functions that you can use to index and search your application.
Lucene users do not have to learn more about Full-text search, just learn how to use several classes in the library, and know how to call functions in library, you can achieve Full-text search for your application.
But don't expect Lucene to be a search engine like Google and Baidu, it's just a tool, a library. You can also think of it as a very good set of Easy-to-use APIs that encapsulates the index and search functionality. With this API you can do a lot of things about search, And very convenient, it can satisfy you to do a simple full-text search for an application, as the application of the developer (non-professional search engine developers), it is enough to meet you.
2. What can lucene.net do?
Lucene can index and search any data. Lucene can be analyzed and utilized by Lucene, regardless of the format of the data source, as long as it can be translated into the form of text. That is, whether it's MS Word, Html, PDF or some other form of file as long as you can extract the text form of content can be used by Lucene. You can index and search them in Lucene.
3, use lucene.net how to do?
It simply boils down to: Create an index, and use an index, where the index is the information that will be searched for the data source that is stored or analyzed as our key information, leaving the tag for the search like a table of contents in Word (personal understanding), Using an index is to analyze the data source based on the information of the index at the time of the search to extract the information we need.
Please take a look at the example:
The class that created the index
public class Intranetindexer
{
/**/////index Writer
Private IndexWriter writer;
The root directory of the file to be written to the index
private string docrootdirectory;
The file format to match
Private string[] pattern;
/**////<summary>
Initializes an index writer writer,directory is the directory where the index was created, true means that if the index file is not present, the index file will be recreated, if the index file already exists, the index file will be overridden
</summary>
<param name= "Directory" > the directory in which to create the index, note that the string value, if the directory does not exist, will be automatically created </param>
Public intranetindexer (String directory)
{
writer = new IndexWriter (directory, new StandardAnalyzer (), true);
Writer. Setusecompoundfile (TRUE);
}
public void Adddirectory (DirectoryInfo directory, string [] pattern)
{
this.docrootdirectory = Directory. FullName;
This.pattern = pattern;
Addsubdirectory (directory);
}
private void Addsubdirectory (DirectoryInfo directory)
{
for (int i=0;i<pattern. Length; i++)
{
foreach (FileInfo fi in Directory.) GetFiles (Pattern[i])
{
Addhtmldocument (FI. FullName);
}
}
foreach (DirectoryInfo di in directory.) GetDirectories ())
{
Addsubdirectory (DI);
}
}
public void Addhtmldocument (string path)
{
String Exname=path.getextension (Path);
Document doc = new document ();
string html;
if (exname. ToLower () = = ". html" | | Exname. ToLower () = = ". htm" | | Exname. ToLower () = = ". txt")
{
using (StreamReader sr=new StreamReader (Path,system). Text. Encoding. Default))
{
html = Sr. ReadToEnd ();
}
}
Else
{
using (StreamReader sr = new StreamReader (path, System.Text.Encoding.Unicode))
{
html = Sr. ReadToEnd ();
}
}
int Relativepathstartsat = this.docRootDirectory.EndsWith ("\ \")? This.docRootDirectory.Length:this.docRootDirectory.Length + 1;
String relativepath = path. Substring (RELATIVEPATHSTARTSAT);
String Title=path.getfilename (Path);
To judge if the Web page is to label or not
if (exname. ToLower () = = ". html" | | Exname. ToLower () = = ". htm")
{
Doc. ADD (field.unstored ("text", parsehtml (HTML));
}
Else
{
Doc. ADD (Field. Unstored ("text", html));
}
Doc. ADD (Field.keyword ("path", RelativePath));
Doc. ADD (Field.text ("title", GetTitle (HTML));
Doc. ADD (Field. Text ("title", title));
Writer. Adddocument (DOC);
}
/**////<summary>
Get rid of labels in web pages
</summary>
<param name= "HTML" > Web page </param>
<returns> return page text after removal </returns>
private string parsehtml (string html)
{
String temp = Regex.Replace (HTML, "<[^>]*>", "");
Return temp. Replace ("", "");
}
/**////<summary>
Get page title
</summary>
<param name= "HTML" ></param>
<returns></returns>
private string GetTitle (string html)
{
Match m = regex.match (HTML, "<title> (. *) </title>");
if (M.groups.count = 2)
Return m.groups[1]. Value;
Return "document title unknown";
}
/**////<summary>
Optimizing the index and closing the writer
</summary>
public void Close ()
{
Writer. Optimize ();
Writer. Close ();
}
}
The Document object is created first, and then some property field is added for the Document object. You can think of the document object as a virtual file, and you'll get it from there in the future. field is considered the metadata that describes this virtual file (metadata). Where field includes four types: keywork
This type of data will not be parsed, but will be indexed and saved in the index.
Unindexed
This type of data will not be parsed and will not be indexed, but will be saved in the index.
Unstored
Just in contrast to unindexed, the analysis is indexed, but not saved.
Text
Similar to unstrored. If the value is of type string, it is also saved. If the type of the value is reader, it will not be saved as unstored.
Finally, each document is added to the index.
The following is a search of the index
Create an indexer
Indexsearcher searcher = new Indexsearcher (indexdirectory);
Parse the index's text field to search
Query query = Queryparser.parse (this. Q, "text", New StandardAnalyzer ());
Put search results in hits
Hits Hits = searcher. Search (query);
Total number of records for a statistical search
This.total = hits. Length ();
Highlight Display
Queryhighlightextractor highlighter = new Queryhighlightextractor (query, New StandardAnalyzer (), <font color=red > "," </font> ");
The first step is to use Indexsearcher to open the index file for subsequent searches, where the parameter is the path to the index file.
The second step uses Queryparser to convert readable query statements, such as the query's word lucene, and some advanced methods Lucene and. NET, into the query objects used within Lucene.
The third step is to perform the search. and return the results to the hits collection. It should be noted that Lucene is not going to put all the results into the hits at once, but instead take a part at a time. For space reasons.
The results of the search are then processed and displayed on the page:
for (int i = StartAt i < resultscount; i++)
{
Document doc = hits. Doc (i);
String path = Doc. Get ("path");
String Location =server.mappath ("documents") + "\" +path;
String Exname=path.getextension (Path);
string plaintext;
String Str=doc. Get ("title");
if (exname== ". html" | | | exname = = ". htm" | | exname = = ". txt")
{
using (StreamReader sr = new StreamReader (location, System.Text.Encoding.Default))
{
plaintext = parsehtml (sr. ReadToEnd ());
}
}
Else
{
using (StreamReader sr = new StreamReader (location, System.Text.Encoding.Unicode))
{
plaintext = Sr. ReadToEnd ();
}
}
DataTable Add row
DataRow row = this. Results.newrow ();
row["title"] = Doc. Get ("title");
String ip=request.url.host;//Get Server IP
Request.Url.Port;
row["path"]=@ "http://" +ip+ "/webui/search/documents/" +path;
row["Sample"] = highlighter. Getbestfragments (plaintext, 80, 2, "");
This. RESULTS.ROWS.ADD (row);
}
Searcher. Close ()//Turn off the Finder
For a more advanced, comprehensive and deeper understanding of lucene.net, please refer to the website:
http://www.alphatom.com/
Http://blog.tianya.cn/blogger/view_blog.asp?BlogName=aftaft