ArticleDirectory
- 1. pangu word segmentation references
- 2. Use panguanalyzer for preparation
- 3. Create an index
- 4. Simple search
This article describes how to use the pangu word segmentation of eaglet to create indexes and perform simple searches in e.net. Pangu word segmentation is a masterpiece of eaglet. If you have not tried it, I hope this article will help you.
1. pangu word segmentation references
Http://www.cnblogs.com/eaglet/tag/%e5%88%86%e8%af%8d/
Http://pangusegment.codeplex.com/
Http://hubbledotnet.51aspx.com/
Http://home.cnblogs.com/group/topic/31349-6.html
The contributions of cool people are not only selfless provision of several tools, several class libraries, several open-source projects, but also their erudition, depth of knowledge understanding, and modest and low-key attitude. The Personality Charm of Daniel is often very attractive. Pay tribute to eaglet.
2. Use panguanalyzer for preparation
(1) pangu. xml file
This file stores the common configurations of pangu word segmentation.DictionarypathSpecifies the directory where the dictionary is located. It can be a relative or absolute path;MatchoptionsCorresponding word segmentation options;ParametersFor Word Segmentation parameters. In this demo, the default configuration is selected. At the same time, on the attributes of the XML file, set its generation operation to"Embedded Resources".
(2) As you know, maintenance of Word Segmentation lexicon is also an important task of pangu word segmentation. Queryapp console applicationProgramDictionaries in the bin directory of is downloaded from codeplex (it is stored in the bin directory of the application by default ). This is the same as that of pangu. XML in (1 ).DictionarypathConfigure nodes.
(3) The Lucene. net. dll, pangu. dll, and pangu. Lucene. analyzer. dll must be referenced simultaneously in the application (this document is called the queryapp console application.
3. Create an index
When indexwriter of Lucene. NET is instantiated, an analyzer instance is required. Using pangu word segmentation to create an index is simple. You only need to pass the panguanalyzer instance into the indexwriter constructor and construct the index step by step:
Public static void prepareindex (bool ispangu) {analyzer = NULL; If (ispangu) {analyzer = new panguanalyzer (); // pangu analyzer} else {analyzer = new standardanalyzer (version. required e_29);} directoryinfo dirinfo = directory. createdirectory (config. index_store_path); luceneio. directory directory = luceneio. fsdirectory. open (dirinfo); indexwriter writer = new indexwriter (directory, analyzer, true, Indexwriter. maxfieldlength. Limited); createindex. As far as I know, he is still a fat man and a piano enthusiast. "); Createindex (writer," Lucene test "," this is a test, concerning e.net's attention to Lao Zhao "); createindex (writer," There are cattle in the blog "," Hello world. I know a master who has extensive knowledge and a geek attitude. He often comes to the garden to see "); createindex (writer," Obama ", "Is Obama the current U.S. president? Are you sure you are not Obaba Niu or Obaba Yang do not know ask old zhao "); createindex (writer," Olympics "," the Olympic Games will come to the beautiful and enthusiastic South American country Brazil, that is, a place in the Amazon River basin "); createindex (writer," write to yourself "," jeffwong of the blog Park, a new start, continue to work hard "); writer. optimize (); writer. close ();}
4. Simple search
Search is also easy to use. You only need to use the analyzer parameter in queryparser to the panguanalyzer instance:
Public static void panguquerytest (analyzer, string field, string keyword) {queryparser parser = new queryparser (version. paie_29, field, analyzer); string panguqueryword = getkeywordssplitbyspace (keyword, new pangutokenizer (); // word segmentation for keywords query = parser. parse (panguqueryword); showqueryexpression (analyzer, query, keyword); searchtoshow (query); console. writeline ();}
The getkeywordssplitbyspace function that performs pangu word segmentation for keywords directly comes from the example provided by eaglet.Code:
Public static string getkeywordssplitbyspace (string keywords, pangutokenizer kttokenizer) {stringbuilder result = new stringbuilder (); icollection <wordinfo> words = kttokenizer. segmenttowordinfos (keywords); foreach (wordinfo word in words) {If (WORD = NULL) {continue;} result. appendformat ("{0} ^ {1 }. 0 ", word. word, (INT) math. pow (3, word. rank);} return result. tostring (). trim ();}
Pangutokenizer's segmenttowordinfos method. I am checking the source code implementation and hope to understand it more thoroughly.
At last, this article is intended for beginners. If you have good experience and suggestions, please do not hesitate to give us your advice.
Download Demo: lucenenetapp