Lucene. Net: Use the pangu word segmentation of eaglet for Word Segmentation and search

Source: Internet
Author: User
Tags createindex
ArticleDirectory
    • 1. pangu word segmentation references
    • 2. Use panguanalyzer for preparation
    • 3. Create an index
    • 4. Simple search

This article describes how to use the pangu word segmentation of eaglet to create indexes and perform simple searches in e.net. Pangu word segmentation is a masterpiece of eaglet. If you have not tried it, I hope this article will help you.

1. pangu word segmentation references

Http://www.cnblogs.com/eaglet/tag/%e5%88%86%e8%af%8d/

Http://pangusegment.codeplex.com/

Http://hubbledotnet.51aspx.com/

Http://home.cnblogs.com/group/topic/31349-6.html

The contributions of cool people are not only selfless provision of several tools, several class libraries, several open-source projects, but also their erudition, depth of knowledge understanding, and modest and low-key attitude. The Personality Charm of Daniel is often very attractive. Pay tribute to eaglet.

 

2. Use panguanalyzer for preparation

(1) pangu. xml file

This file stores the common configurations of pangu word segmentation.DictionarypathSpecifies the directory where the dictionary is located. It can be a relative or absolute path;MatchoptionsCorresponding word segmentation options;ParametersFor Word Segmentation parameters. In this demo, the default configuration is selected. At the same time, on the attributes of the XML file, set its generation operation to"Embedded Resources".

(2) As you know, maintenance of Word Segmentation lexicon is also an important task of pangu word segmentation. Queryapp console applicationProgramDictionaries in the bin directory of is downloaded from codeplex (it is stored in the bin directory of the application by default ). This is the same as that of pangu. XML in (1 ).DictionarypathConfigure nodes.

(3) The Lucene. net. dll, pangu. dll, and pangu. Lucene. analyzer. dll must be referenced simultaneously in the application (this document is called the queryapp console application.

 

3. Create an index

When indexwriter of Lucene. NET is instantiated, an analyzer instance is required. Using pangu word segmentation to create an index is simple. You only need to pass the panguanalyzer instance into the indexwriter constructor and construct the index step by step:

Public static void prepareindex (bool ispangu) {analyzer = NULL; If (ispangu) {analyzer = new panguanalyzer (); // pangu analyzer} else {analyzer = new standardanalyzer (version. required e_29);} directoryinfo dirinfo = directory. createdirectory (config. index_store_path); luceneio. directory directory = luceneio. fsdirectory. open (dirinfo); indexwriter writer = new indexwriter (directory, analyzer, true, Indexwriter. maxfieldlength. Limited); createindex. As far as I know, he is still a fat man and a piano enthusiast. "); Createindex (writer," Lucene test "," this is a test, concerning e.net's attention to Lao Zhao "); createindex (writer," There are cattle in the blog "," Hello world. I know a master who has extensive knowledge and a geek attitude. He often comes to the garden to see "); createindex (writer," Obama ", "Is Obama the current U.S. president? Are you sure you are not Obaba Niu or Obaba Yang do not know ask old zhao "); createindex (writer," Olympics "," the Olympic Games will come to the beautiful and enthusiastic South American country Brazil, that is, a place in the Amazon River basin "); createindex (writer," write to yourself "," jeffwong of the blog Park, a new start, continue to work hard "); writer. optimize (); writer. close ();}

 

4. Simple search

Search is also easy to use. You only need to use the analyzer parameter in queryparser to the panguanalyzer instance:

 
Public static void panguquerytest (analyzer, string field, string keyword) {queryparser parser = new queryparser (version. paie_29, field, analyzer); string panguqueryword = getkeywordssplitbyspace (keyword, new pangutokenizer (); // word segmentation for keywords query = parser. parse (panguqueryword); showqueryexpression (analyzer, query, keyword); searchtoshow (query); console. writeline ();}

The getkeywordssplitbyspace function that performs pangu word segmentation for keywords directly comes from the example provided by eaglet.Code:

Public static string getkeywordssplitbyspace (string keywords, pangutokenizer kttokenizer) {stringbuilder result = new stringbuilder (); icollection <wordinfo> words = kttokenizer. segmenttowordinfos (keywords); foreach (wordinfo word in words) {If (WORD = NULL) {continue;} result. appendformat ("{0} ^ {1 }. 0 ", word. word, (INT) math. pow (3, word. rank);} return result. tostring (). trim ();}

Pangutokenizer's segmenttowordinfos method. I am checking the source code implementation and hope to understand it more thoroughly.

At last, this article is intended for beginners. If you have good experience and suggestions, please do not hesitate to give us your advice.

 

Download Demo: lucenenetapp

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.