"Turn" Jieba. NET and Lucene.Net integration

Source: Internet
Author: User

first of all : I am not familiar with lucene.net, but the search is really an important application of participle, so here is still trying to integrate the two, perhaps you have a reference.

See two Chinese word breakers and lucene.net integration projects: Lucene.Net.Analysis.PanGu and Lucene.Net.Analysis.MMSeg, refer to the code for the simplest integration: jiebaforlucenenet. A brief introduction is given below.

1, Jiebatokenizer

The main integration point is to customize a subclass of Tokenizer , at which point it is necessary to implement its abstract method Incrementtoken, which is used to traverse the tokens generated by the text in the text stream. This is where sub-phrases play a part.

 Public Override BOOLIncrementtoken () {clearattributes (); Position++; if(Position <tokens. Count) {vartoken =Tokens[position]; Termatt.settermbuffer (token.        Word); Offsetatt.setoffset (token. StartIndex, token.        EndIndex); Typeatt.type="Jieba"; return true;    } End (); return false;}

The two lines of code for Termatt and Offsetatt need to use each token's word itself, the starting index, and the terminating index, and these three values happen to be implemented by the jiebasegmenter.tokenize method. So as long as the Jiebatokenizer is initialized with:

tokens = Segmenter. Tokenize (text, Tokenizermode.search). ToList ();

You can get all the tokens from the participle, and the Tokenizermode.search parameter allows the results of the Tokenize method to include more comprehensive participle results, such as "linguists" will get four tokens, i.e. "[Language, (0, 2)], [Learner, (2, 4) ], [Linguistics, (0, 3)], [linguist, (0, 4)] ", which is helpful when creating indexes and searches.

2, Jiebaanalyzer

The Tokenizer class implements participles, and adding indexes and searches requires analyzer,jiebaanalyzer as long as the Jiebatokenizer is called.

 Public OverrideTokenstream Tokenstream (stringFieldName, TextReader Reader) {    varSEG =NewJiebasegmenter (); Tokenstream result=NewJiebatokenizer (seg, reader); //This filter is necessary, because the parser converts the queries-lower case.result =Newlowercasefilter (Result); Result=NewStopfilter (true, result, stopwords); returnresult;}

lowercasefilter and stopfilterare used in addition to Jiebatokenizer,jiebaanalyzer. The former can regularization the contents of the index and search, ignoring the case, while the latter filters out the inactive words. The list of discontinued words used here incorporates NLTK's English stop word and hit's Chinese stop word.

3. Create indexes and search

When you create an index, IndexWriter uses an instance of Jiebaanalyzer:

var New Jiebaanalyzer (); using (varnew  IndexWriter (Directory, Analyzer, IndexWriter.MaxFieldLength.UNLIMITED)) {     // replaces older entry if    any foreach (var in data)    {        Addtoluceneindex (SD, writer);    }    Analyzer. Close ();}

When searching, the user's input participle is first:

Private Static stringGetkeywordssplitbyspace (stringkeywords, jiebatokenizer tokenizer) {    varresult =NewStringBuilder (); varWords =Tokenizer.    Tokenize (keywords); foreach(varWordinchwords) {        if(string. Isnullorwhitespace (Word. Word)) {Continue; } result. AppendFormat ("{0}", Word.    Word); }    returnresult. ToString (). Trim ();}

For example, if the user enters a "linguist", then the return value of the function is "linguist linguistics", which prepares for subsequent searches (in addition, we can add a * to each word so that the results can be found only if a partial match is available). The final search implementation is:

Private StaticIenumerable<news> SearchQuery (stringSearchQuery,stringSearchfield =""){    if(string. IsNullOrEmpty (Searchquery.replace ("*",""). Replace ("?","")))    {        return NewList<news>(); }    using(varSearcher =NewIndexsearcher (Directory,false))    {        varHitslimit = +; //var analyzer = new StandardAnalyzer (version.lucene_30);        varAnalyzer =Getanalyzer (); if(!string. IsNullOrEmpty (Searchfield)) {varParser =NewQueryparser (version.lucene_30, Searchfield, analyzer); varquery =parsequery (searchQuery, parser); varhits =Searcher. Search (query, Hitslimit).            Scoredocs; varResults =maplucenetodatalist (hits, searcher); Analyzer.            Dispose (); returnresults; }        Else        {            varParser =NewMultifieldqueryparser (version.lucene_30,New[] {"Id","Title","Content"}, Analyzer); varquery =parsequery (searchQuery, parser); varhits = searcher. Search (Query,NULL, Hitslimit, sort.relevance).            Scoredocs; varResults =maplucenetodatalist (hits, searcher); Analyzer.            Close (); returnresults; }    }}

The Searchfield parameter here allows you to specify a specific field for the search, and if it is empty, all fields are searched. At this point, the most basic integration is achieved.

Implementations of Jiebatokenizer, Jiebaanalyzer, and sample code can be found in jiebaforlucenenet.

4, Luke.net

Luke.net can view the indexed content generated by lucene.net, which is especially helpful when developing and debugging Lucene.

Reference:

Lucene.Net Ultra Fast Search for MVC or WebForms site

Lucene.net–custom synonym Analyzer

Https://github.com/JimLiu/Lucene.Net.Analysis.PanGu

Http://pangusegment.codeplex.com/wikipage?title=PanGu4Lucene

http://luke.codeplex.com/releases/view/82033

"Turn" Jieba. NET and Lucene.Net integration

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.