lucene+ Pangu Participle

Source: Internet
Author: User

General web site will have a search function, the general implementation of the search mainly has three kinds of programs

The first is the worst and least recommended, using a database of fuzzy queries such as a SELECT * form table where field like XXX, the disadvantage of this query is obvious:

(1) Unable to find a few key words not connected together situation

(2) Low efficiency of full table scan

Second: using SQL Server full-text search function

Example: SELECT * Form table where msg = ' Nanjing, Jiangsu '

This can be written as select * Form table where Msg.contains (' Jiangsu Nanjing ');

So the results of the search can contain both Jiangsu and Nanjing, and match the speed is also fast, but also to achieve participle.

Disadvantages:

(1): Only genuine SQL Server supports the above technology

(2): The database is not very flexible word segmentation, you can not modify the thesaurus

The third type: using lucene.net(This article focuses on the explanation)

Lucene.Net is just a full-text search development package, not a mature search engine, his function is: to give the data to Lucene.Net, When querying data from lucene.net query data, can be regarded as a full-text retrieval function of the database, lucene.net only the text information retrieval, if not text information, to be converted to textual information. Lucene will save the word that is thrown to him, because it is a time-saving participle (word cut), so the search is fast.

Participle is the key to good search results:

Lucene different word segmentation algorithms are different classes, all the word segmentation algorithm classes inherit from the Analyzer class, different segmentation algorithms have different advantages and disadvantages.

For example: The built-in StandardAnalyzer is the English according to the space, punctuation, and so on word segmentation, Chinese words in accordance with a single word, a Chinese character counted a word (so-called one yuan participle)

Binary participle: every two Chinese characters counted a word, "welcome you all" will be divided into "welcome", greet you, you, the big, everybody wants to download a two Yuan participle algorithm on the net: Cjkanalyzer

Word-base-based segmentation algorithm: Based on a word base for word segmentation, can improve the success rate of participle, there are discovering, pangu participle and so on. Low efficiency (relative to one-dollar participle and two-yuan participle) but high accuracy

Note: Lucene. NET to the Chinese word segmentation effect is not good, need to rely on third-party segmentation algorithm: Open source of Pangu participle (can be downloaded in the open source Chinese community, there are detailed demo, and DLL files)

Write the following code (experience Pangu participle):

Download Pangu participle forlucene in China Open source Community

The first step: Copy the Dictionaries folder under the Bin folder in the Webdemo to the root of the project, then change the folder name to Dict and set the properties of the contents inside if newer copy to the output directory

Part II: Adding references to Lucene.net.dll files and PanGu.Lucene.Analyzer.dll files

Analyzer Analyzer = new Lucene.Net.Analysis.PanGu.PanGuAnalyzer ();

Tokenstream Tokenstream = Analyzer. Tokenstream ("", New System.IO.StringReader ("Beijing, hi welcome you All"));

Lucene.Net.Analysis.Token Token = null;

while (token = Tokenstream.next ()) = null)

{

LISTBOX1.ITEMS.ADD (token. Termtext ());

}

Since it is participle, there must be a thesaurus, there is a thesaurus can be modified, using just the Bin folder under the PanGu.Lucene.ImportTool.exe file to open the thesaurus to modify the content of the thesaurus, you can achieve the latest word segmentation effect.

Can be understood as:

First set up an index system, and then open the index system, using the Lucene IndexWriter class to write the index (document object), the object from the Pangu participle of the existing article word to get

Basic ideas such as

Lucene+pangu

Create Lucene Thesaurus: (Give data to Lucene, use Pangu participle), and implement the search function (second code)

1 protected voidButton4_Click (Objectsender, EventArgs e)2         {3             stringIndexpath = Server.MapPath (@"/demo/lucenedir");//Note the case is the same as the folder on the disk, or you will get an error. Place the word breaker that you created in the directory. 4 5             //Specify index file (open index directory) fs refers to the filesystem I understand: Indexing system6Fsdirectory directory = Fsdirectory.open (NewDirectoryInfo (Indexpath),Newnativefslockfactory ());7 8             //Indexreader: The class that reads the index. The purpose of this statement is to determine whether the index library folder exists and whether the index signature file exists. 9             BOOLIsupdate =indexreader.indexexists (directory);Ten             if(isupdate) One             { A                 //only one piece of code can write to the index library at the same time. When you open Directory with IndexWriter, the index library files are automatically locked.  -                 //if the index directory is locked (for example, the program exits unexpectedly during indexing), the first unlock -                 //(Hint: If I am now writing a lock, but have not finished, then a request, then do not unlock it?) This problem will be resolved later) the                 if(indexwriter.islocked (directory)) -                 { - indexwriter.unlock (directory); -                 } +             } -  +             //writes an index to the index library. This is where the lock is added.  AIndexWriter writer =NewIndexWriter (Directory,NewPanguanalyzer (),!isupdate, Lucene.Net.Index.IndexWriter.MaxFieldLength.UNLIMITED); at              for(inti =1; I <=Ten; i++) -             { -                 stringtxt = file.readalltext (Server.MapPath (@"/demo/test File/"+ i +". txt"), System.Text.Encoding.Default);//note The code for this place -Document document =NewDocument ();//represents a document.  -                 //Field.Store.YES: Indicates whether the original value is stored. The value can only be fetched with Doc.get ("number") when the Field.Store.YES is in the back. Field.index. not_analyzed: Do not save the word breaker -Document. ADD (NewField (" Number", i.ToString (), Field.Store.YES, Field.Index.NOT_ANALYZED)); in  -                 //field.index. ANALYZED: Word breaker: That is, to make a full-text field to set the word breaker (because you want to make a fuzzy query) to  +                 //Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS: Not only save the word, but also save the distance.  -Document. ADD (NewField ("Body", TXT, Field.Store.YES, Field.Index.ANALYZED, Lucene.Net.Documents.Field.TermVector.WITH_POSITIONS_OFFSETS)); the writer. Adddocument (document); *  $             }Panax NotoginsengWriter. Close ();//is automatically unlocked.  -Directory. Close ();//do not forget close, otherwise the index results cannot be searched the}


1 protected voidButton5_click (Objectsender, EventArgs e)2         {3             //The contents of the created participle are stored in the directory4             stringIndexpath = Server.MapPath (@"/demo/lucenedir"); ;5             stringKW ="C #";6KW =kw. ToLower ();7 8Fsdirectory directory = Fsdirectory.open (NewDirectoryInfo (Indexpath),Newnolockfactory ());9Indexreader reader = indexreader.open (directory,true);TenIndexsearcher searcher =NewIndexsearcher (reader); One  A             //Search Criteria -Phrasequery query =Newphrasequery (); -             //foreach (string word in kw. Split ("))//first with a space, let users go to participle, the space is separated by the word "computer professional" the             //{ -             //query. ADD (New term ("body", word)); -             //} -             //query. Add (New term ("body", "language");--You can add a query condition, which is an add relationship. The order is not related. +             //query. ADD (New term ("Body", "university student")); -Query. ADD (NewTerm ("Body", kw));//articles in body containing KW +Query. Setslop ( -);//The maximum distance between words in multiple query criteria. It is meaningless to be too far apart in the article. (for example, "college student" This query condition and "resume" this query condition between the words of the interval is too much also meaningless.) ) A             //TopScoreDocCollector is a container for the results of a query. atTopScoreDocCollector collector = Topscoredoccollector.create ( +,true); -Searcher. Search (Query,NULL, collector);//Query according to query criteria, query results into the collector container -  -             //get all the documents in the query results, gettotalhits (): Indicates the total number of bars topdocs (+);//represents the document content that gets 300 (starting from 300), to 320 (end). -             //can be used to achieve paging functionality -scoredoc[] Docs = Collector. Topdocs (0, collector. Gettotalhits ()). Scoredocs; in              This. ListBox1.Items.Clear (); -              for(inti =0; I < Docs. Length; i++) to             { +                 //Search scoredoc[] can only get the ID of the document, which will not load the query results in memory once.  -                 //reduce the memory pressure and need to get the details of the document by searcher. Doc to get the document's detail object documents based on the document ID. the                 //gets the ID of the query result document (Lucene internally assigned ID) *                 intDocId =Docs[i].doc; $ Panax Notoginseng                 //Find document details for document ID -Document doc =Searcher. Doc (docId); the  +                 //Remove the value that is placed into the field A                  This. ListBox1.Items.Add (Doc. Get (" Number") +"\ n"); the                  This. ListBox1.Items.Add (Doc. Get ("Body") +"\ n"); +                  This. LISTBOX1.ITEMS.ADD ("-----------------------\ n"); -             } $}


Note: The search is case-sensitive, in order to not differentiate, you can create a thesaurus directly to all the words are converted to uppercase or lowercase, the search time to make a corresponding conversion

lucene+ Pangu Participle

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.