"Turn" Jieba. NET and Lucene.Net integration

Last Update:2017-10-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

first of all : I am not familiar with lucene.net, but the search is really an important application of participle, so here is still trying to integrate the two, perhaps you have a reference.

See two Chinese word breakers and lucene.net integration projects: Lucene.Net.Analysis.PanGu and Lucene.Net.Analysis.MMSeg, refer to the code for the simplest integration: jiebaforlucenenet. A brief introduction is given below.

1, Jiebatokenizer

The main integration point is to customize a subclass of Tokenizer , at which point it is necessary to implement its abstract method Incrementtoken, which is used to traverse the tokens generated by the text in the text stream. This is where sub-phrases play a part.

 Public Override BOOLIncrementtoken () {clearattributes (); Position++; if(Position <tokens. Count) {vartoken =Tokens[position]; Termatt.settermbuffer (token.        Word); Offsetatt.setoffset (token. StartIndex, token.        EndIndex); Typeatt.type="Jieba"; return true;    } End (); return false;}

The two lines of code for Termatt and Offsetatt need to use each token's word itself, the starting index, and the terminating index, and these three values happen to be implemented by the jiebasegmenter.tokenize method. So as long as the Jiebatokenizer is initialized with:

tokens = Segmenter. Tokenize (text, Tokenizermode.search). ToList ();

You can get all the tokens from the participle, and the Tokenizermode.search parameter allows the results of the Tokenize method to include more comprehensive participle results, such as "linguists" will get four tokens, i.e. "[Language, (0, 2)], [Learner, (2, 4) ], [Linguistics, (0, 3)], [linguist, (0, 4)] ", which is helpful when creating indexes and searches.

2, Jiebaanalyzer

The Tokenizer class implements participles, and adding indexes and searches requires analyzer,jiebaanalyzer as long as the Jiebatokenizer is called.

 Public OverrideTokenstream Tokenstream (stringFieldName, TextReader Reader) {    varSEG =NewJiebasegmenter (); Tokenstream result=NewJiebatokenizer (seg, reader); //This filter is necessary, because the parser converts the queries-lower case.result =Newlowercasefilter (Result); Result=NewStopfilter (true, result, stopwords); returnresult;}

lowercasefilter and stopfilterare used in addition to Jiebatokenizer,jiebaanalyzer. The former can regularization the contents of the index and search, ignoring the case, while the latter filters out the inactive words. The list of discontinued words used here incorporates NLTK's English stop word and hit's Chinese stop word.

3. Create indexes and search

When you create an index, IndexWriter uses an instance of Jiebaanalyzer:

var New Jiebaanalyzer (); using (varnew  IndexWriter (Directory, Analyzer, IndexWriter.MaxFieldLength.UNLIMITED)) {     // replaces older entry if    any foreach (var in data)    {        Addtoluceneindex (SD, writer);    }    Analyzer. Close ();}

When searching, the user's input participle is first:

Private Static stringGetkeywordssplitbyspace (stringkeywords, jiebatokenizer tokenizer) {    varresult =NewStringBuilder (); varWords =Tokenizer.    Tokenize (keywords); foreach(varWordinchwords) {        if(string. Isnullorwhitespace (Word. Word)) {Continue; } result. AppendFormat ("{0}", Word.    Word); }    returnresult. ToString (). Trim ();}

For example, if the user enters a "linguist", then the return value of the function is "linguist linguistics", which prepares for subsequent searches (in addition, we can add a * to each word so that the results can be found only if a partial match is available). The final search implementation is:

Private StaticIenumerable<news> SearchQuery (stringSearchQuery,stringSearchfield =""){    if(string. IsNullOrEmpty (Searchquery.replace ("*",""). Replace ("?","")))    {        return NewList<news>(); }    using(varSearcher =NewIndexsearcher (Directory,false))    {        varHitslimit = +; //var analyzer = new StandardAnalyzer (version.lucene_30);        varAnalyzer =Getanalyzer (); if(!string. IsNullOrEmpty (Searchfield)) {varParser =NewQueryparser (version.lucene_30, Searchfield, analyzer); varquery =parsequery (searchQuery, parser); varhits =Searcher. Search (query, Hitslimit).            Scoredocs; varResults =maplucenetodatalist (hits, searcher); Analyzer.            Dispose (); returnresults; }        Else        {            varParser =NewMultifieldqueryparser (version.lucene_30,New[] {"Id","Title","Content"}, Analyzer); varquery =parsequery (searchQuery, parser); varhits = searcher. Search (Query,NULL, Hitslimit, sort.relevance).            Scoredocs; varResults =maplucenetodatalist (hits, searcher); Analyzer.            Close (); returnresults; }    }}

The Searchfield parameter here allows you to specify a specific field for the search, and if it is empty, all fields are searched. At this point, the most basic integration is achieved.

Implementations of Jiebatokenizer, Jiebaanalyzer, and sample code can be found in jiebaforlucenenet.

4, Luke.net

Luke.net can view the indexed content generated by lucene.net, which is especially helpful when developing and debugging Lucene.

Reference:

Lucene.Net Ultra Fast Search for MVC or WebForms site

Lucene.net–custom synonym Analyzer

Https://github.com/JimLiu/Lucene.Net.Analysis.PanGu

Http://pangusegment.codeplex.com/wikipage?title=PanGu4Lucene

http://luke.codeplex.com/releases/view/82033

"Turn" Jieba. NET and Lucene.Net integration

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"Turn" Jieba. NET and Lucene.Net integration

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

"Turn" Jieba. NET and Lucene.Net integration

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support