first of all : I am not familiar with lucene.net, but the search is really an important application of participle, so here is still trying to integrate the two, perhaps you have a reference.
See two Chinese word breakers and lucene.net integration projects: Lucene.Net.Analysis.PanGu and Lucene.Net.Analysis.MMSeg, refer to the code for the simplest integration: jiebaforlucenenet. A brief introduction is given below.
1, Jiebatokenizer
The main integration point is to customize a subclass of Tokenizer , at which point it is necessary to implement its abstract method Incrementtoken, which is used to traverse the tokens generated by the text in the text stream. This is where sub-phrases play a part.
Public Override BOOLIncrementtoken () {clearattributes (); Position++; if(Position <tokens. Count) {vartoken =Tokens[position]; Termatt.settermbuffer (token. Word); Offsetatt.setoffset (token. StartIndex, token. EndIndex); Typeatt.type="Jieba"; return true; } End (); return false;}
The two lines of code for Termatt and Offsetatt need to use each token's word itself, the starting index, and the terminating index, and these three values happen to be implemented by the jiebasegmenter.tokenize method. So as long as the Jiebatokenizer is initialized with:
tokens = Segmenter. Tokenize (text, Tokenizermode.search). ToList ();
You can get all the tokens from the participle, and the Tokenizermode.search parameter allows the results of the Tokenize method to include more comprehensive participle results, such as "linguists" will get four tokens, i.e. "[Language, (0, 2)], [Learner, (2, 4) ], [Linguistics, (0, 3)], [linguist, (0, 4)] ", which is helpful when creating indexes and searches.
2, Jiebaanalyzer
The Tokenizer class implements participles, and adding indexes and searches requires analyzer,jiebaanalyzer as long as the Jiebatokenizer is called.
Public OverrideTokenstream Tokenstream (stringFieldName, TextReader Reader) { varSEG =NewJiebasegmenter (); Tokenstream result=NewJiebatokenizer (seg, reader); //This filter is necessary, because the parser converts the queries-lower case.result =Newlowercasefilter (Result); Result=NewStopfilter (true, result, stopwords); returnresult;}
lowercasefilter and stopfilterare used in addition to Jiebatokenizer,jiebaanalyzer. The former can regularization the contents of the index and search, ignoring the case, while the latter filters out the inactive words. The list of discontinued words used here incorporates NLTK's English stop word and hit's Chinese stop word.
3. Create indexes and search
When you create an index, IndexWriter uses an instance of Jiebaanalyzer:
var New Jiebaanalyzer (); using (varnew IndexWriter (Directory, Analyzer, IndexWriter.MaxFieldLength.UNLIMITED)) { // replaces older entry if any foreach (var in data) { Addtoluceneindex (SD, writer); } Analyzer. Close ();}
When searching, the user's input participle is first:
Private Static stringGetkeywordssplitbyspace (stringkeywords, jiebatokenizer tokenizer) { varresult =NewStringBuilder (); varWords =Tokenizer. Tokenize (keywords); foreach(varWordinchwords) { if(string. Isnullorwhitespace (Word. Word)) {Continue; } result. AppendFormat ("{0}", Word. Word); } returnresult. ToString (). Trim ();}
For example, if the user enters a "linguist", then the return value of the function is "linguist linguistics", which prepares for subsequent searches (in addition, we can add a * to each word so that the results can be found only if a partial match is available). The final search implementation is:
Private StaticIenumerable<news> SearchQuery (stringSearchQuery,stringSearchfield =""){ if(string. IsNullOrEmpty (Searchquery.replace ("*",""). Replace ("?",""))) { return NewList<news>(); } using(varSearcher =NewIndexsearcher (Directory,false)) { varHitslimit = +; //var analyzer = new StandardAnalyzer (version.lucene_30); varAnalyzer =Getanalyzer (); if(!string. IsNullOrEmpty (Searchfield)) {varParser =NewQueryparser (version.lucene_30, Searchfield, analyzer); varquery =parsequery (searchQuery, parser); varhits =Searcher. Search (query, Hitslimit). Scoredocs; varResults =maplucenetodatalist (hits, searcher); Analyzer. Dispose (); returnresults; } Else { varParser =NewMultifieldqueryparser (version.lucene_30,New[] {"Id","Title","Content"}, Analyzer); varquery =parsequery (searchQuery, parser); varhits = searcher. Search (Query,NULL, Hitslimit, sort.relevance). Scoredocs; varResults =maplucenetodatalist (hits, searcher); Analyzer. Close (); returnresults; } }}
The Searchfield parameter here allows you to specify a specific field for the search, and if it is empty, all fields are searched. At this point, the most basic integration is achieved.
Implementations of Jiebatokenizer, Jiebaanalyzer, and sample code can be found in jiebaforlucenenet.
4, Luke.net
Luke.net can view the indexed content generated by lucene.net, which is especially helpful when developing and debugging Lucene.
Reference:
Lucene.Net Ultra Fast Search for MVC or WebForms site
Lucene.net–custom synonym Analyzer
Https://github.com/JimLiu/Lucene.Net.Analysis.PanGu
Http://pangusegment.codeplex.com/wikipage?title=PanGu4Lucene
http://luke.codeplex.com/releases/view/82033
"Turn" Jieba. NET and Lucene.Net integration