[Switch] integration of jieba. NET and Lucene. Net,
First declare: I'm not familiar with Lucene. Net,SearchIt is indeed an important application of word segmentation, so here we try to integrate the two, which may be a reference for you.
We can see two Chinese Word Segmentation and Lucene. net Integration Project: Lucene. net. analysis. panGu and Lucene. net. analysis. MMSeg. Refer to the Code to implement the simplest integration: jiebaForLuceneNet. The following is a brief introduction.
1. JiebaTokenizer
The main integration point is to customizeTokenizerIn this case, you must implement its abstract method.IncrementTokenThis method is used to traverse the tokens generated by text in a text stream. This is where the word segmentation component plays a role.
public override bool IncrementToken(){ ClearAttributes(); position++; if (position < tokens.Count) { var token = tokens[position]; termAtt.SetTermBuffer(token.Word); offsetAtt.SetOffset(token.StartIndex, token.EndIndex); typeAtt.Type = "Jieba"; return true; } End(); return false;}
The two lines of code where termAtt and offsetAtt are located need to use the word, start index, and end index of each token. These three values exactly matchJiebaSegmenter. TokenizeMethod, so you only need to use it when initializing JiebaTokenizer:
tokens = segmenter.Tokenize(text, TokenizerMode.Search).ToList();
You can get all tokens obtained by word segmentation, And the TokenizerMode. the Search parameter allows the results of the Tokenize method to contain more comprehensive word segmentation results. For example, the "linguistics" will get four tokens, namely, "[language, (0, 2)], [scientist, (2, 4)], [linguistics, (0, 3)], [linguistics, (0, 4)] ", which is helpful in index creation and search.
2. JiebaAnalyzer
Tokenizer class implements word segmentation, while Analyzer is required to add indexes and search. JiebaAnalyzer only needs to call JiebaTokenizer.
public override TokenStream TokenStream(string fieldName, TextReader reader){ var seg = new JiebaSegmenter(); TokenStream result = new JiebaTokenizer(seg, reader); // This filter is necessary, because the parser converts the queries to lower case. result = new LowerCaseFilter(result); result = new StopFilter(true, result, StopWords); return result;}
In addition to JiebaTokenizer, JiebaAnalyzer also usesLowerCaseFilterAndStopFilter. The former can normalize the index and search content, regardless of Case sensitivity. The latter can filter out deprecated words. The stopword list used here combines the English stopword of NLTK with the Chinese stopword of Harbin Institute of Technology.
3. Create indexes and search
When creating an index, IndexWriter uses the JiebaAnalyzer instance:
var analyzer = new JiebaAnalyzer();using (var writer = new IndexWriter(Directory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED)){ // replaces older entry if any foreach (var sd in data) { AddToLuceneIndex(sd, writer); } analyzer.Close();}
When searching, we first split the user's input into words:
private static string GetKeyWordsSplitBySpace(string keywords, JiebaTokenizer tokenizer){ var result = new StringBuilder(); var words = tokenizer.Tokenize(keywords); foreach (var word in words) { if (string.IsNullOrWhiteSpace(word.Word)) { continue; } result.AppendFormat("{0} ", word.Word); } return result.ToString().Trim();}
For example, if the user inputs a "linguistic", the return value of this function is "linguistic linguistics", which prepares for the subsequent search (In addition, we can also add a * for each word so that the result can be searched as long as part of the match ). The final search implementation is:
private static IEnumerable<News> SearchQuery(string searchQuery, string searchField = ""){ if (string.IsNullOrEmpty(searchQuery.Replace("*", "").Replace("?", ""))) { return new List<News>(); } using (var searcher = new IndexSearcher(Directory, false)) { var hitsLimit = 1000; //var analyzer = new StandardAnalyzer(Version.LUCENE_30); var analyzer = GetAnalyzer(); if (!string.IsNullOrEmpty(searchField)) { var parser = new QueryParser(Version.LUCENE_30, searchField, analyzer); var query = ParseQuery(searchQuery, parser); var hits = searcher.Search(query, hitsLimit).ScoreDocs; var results = MapLuceneToDataList(hits, searcher); analyzer.Dispose(); return results; } else { var parser = new MultiFieldQueryParser(Version.LUCENE_30, new[] { "Id", "Title", "Content" }, analyzer); var query = ParseQuery(searchQuery, parser); var hits = searcher.Search(query, null, hitsLimit, Sort.RELEVANCE).ScoreDocs; var results = MapLuceneToDataList(hits, searcher); analyzer.Close(); return results; } }}
The searchField parameter can specify a specific field for search. If it is null, all fields are searched. So far, the most basic integration has been implemented.
The implementation and sample code of JiebaTokenizer and JiebaAnalyzer can be found in jiebaForLuceneNet.
4. Luke. Net
Luke. Net can view the index content generated by Lucene. Net, which is particularly helpful in the Development and debugging of Lucene.
Refer:
Lucene. Net ultra fast search for MVC or WebForms site
Lucene. Net-Custom Synonym Analyzer
Https://github.com/JimLiu/Lucene.Net.Analysis.PanGu
Http://pangusegment.codeplex.com/wikipage? Title = PanGu4Lucene
Http://luke.codeplex.com/releases/view/82033