Integration of jieba. NET and Lucene. Net,

Last Update:2017-10-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

[Switch] integration of jieba. NET and Lucene. Net,

First declare: I'm not familiar with Lucene. Net,SearchIt is indeed an important application of word segmentation, so here we try to integrate the two, which may be a reference for you.

We can see two Chinese Word Segmentation and Lucene. net Integration Project: Lucene. net. analysis. panGu and Lucene. net. analysis. MMSeg. Refer to the Code to implement the simplest integration: jiebaForLuceneNet. The following is a brief introduction.

1. JiebaTokenizer

The main integration point is to customizeTokenizerIn this case, you must implement its abstract method.IncrementTokenThis method is used to traverse the tokens generated by text in a text stream. This is where the word segmentation component plays a role.

public override bool IncrementToken(){    ClearAttributes();    position++;    if (position < tokens.Count)    {        var token = tokens[position];        termAtt.SetTermBuffer(token.Word);        offsetAtt.SetOffset(token.StartIndex, token.EndIndex);        typeAtt.Type = "Jieba";        return true;    }    End();    return false;}

The two lines of code where termAtt and offsetAtt are located need to use the word, start index, and end index of each token. These three values exactly matchJiebaSegmenter. TokenizeMethod, so you only need to use it when initializing JiebaTokenizer:

tokens = segmenter.Tokenize(text, TokenizerMode.Search).ToList();

You can get all tokens obtained by word segmentation, And the TokenizerMode. the Search parameter allows the results of the Tokenize method to contain more comprehensive word segmentation results. For example, the "linguistics" will get four tokens, namely, "[language, (0, 2)], [scientist, (2, 4)], [linguistics, (0, 3)], [linguistics, (0, 4)] ", which is helpful in index creation and search.

2. JiebaAnalyzer

Tokenizer class implements word segmentation, while Analyzer is required to add indexes and search. JiebaAnalyzer only needs to call JiebaTokenizer.

public override TokenStream TokenStream(string fieldName, TextReader reader){    var seg = new JiebaSegmenter();    TokenStream result = new JiebaTokenizer(seg, reader);    // This filter is necessary, because the parser converts the queries to lower case.    result = new LowerCaseFilter(result);    result = new StopFilter(true, result, StopWords);    return result;}

In addition to JiebaTokenizer, JiebaAnalyzer also usesLowerCaseFilterAndStopFilter. The former can normalize the index and search content, regardless of Case sensitivity. The latter can filter out deprecated words. The stopword list used here combines the English stopword of NLTK with the Chinese stopword of Harbin Institute of Technology.

3. Create indexes and search

When creating an index, IndexWriter uses the JiebaAnalyzer instance:

var analyzer = new JiebaAnalyzer();using (var writer = new IndexWriter(Directory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED)){    // replaces older entry if any    foreach (var sd in data)    {        AddToLuceneIndex(sd, writer);    }    analyzer.Close();}

When searching, we first split the user's input into words:

private static string GetKeyWordsSplitBySpace(string keywords, JiebaTokenizer tokenizer){    var result = new StringBuilder();    var words = tokenizer.Tokenize(keywords);    foreach (var word in words)    {        if (string.IsNullOrWhiteSpace(word.Word))        {            continue;        }        result.AppendFormat("{0} ", word.Word);    }    return result.ToString().Trim();}

For example, if the user inputs a "linguistic", the return value of this function is "linguistic linguistics", which prepares for the subsequent search (In addition, we can also add a * for each word so that the result can be searched as long as part of the match ). The final search implementation is:

private static IEnumerable<News> SearchQuery(string searchQuery, string searchField = ""){    if (string.IsNullOrEmpty(searchQuery.Replace("*", "").Replace("?", "")))    {        return new List<News>();    }    using (var searcher = new IndexSearcher(Directory, false))    {        var hitsLimit = 1000;        //var analyzer = new StandardAnalyzer(Version.LUCENE_30);        var analyzer = GetAnalyzer();        if (!string.IsNullOrEmpty(searchField))        {            var parser = new QueryParser(Version.LUCENE_30, searchField, analyzer);            var query = ParseQuery(searchQuery, parser);            var hits = searcher.Search(query, hitsLimit).ScoreDocs;            var results = MapLuceneToDataList(hits, searcher);            analyzer.Dispose();            return results;        }        else        {            var parser = new MultiFieldQueryParser(Version.LUCENE_30, new[] { "Id", "Title", "Content" }, analyzer);            var query = ParseQuery(searchQuery, parser);            var hits = searcher.Search(query, null, hitsLimit, Sort.RELEVANCE).ScoreDocs;            var results = MapLuceneToDataList(hits, searcher);            analyzer.Close();            return results;        }    }}

The searchField parameter can specify a specific field for search. If it is null, all fields are searched. So far, the most basic integration has been implemented.

The implementation and sample code of JiebaTokenizer and JiebaAnalyzer can be found in jiebaForLuceneNet.

4. Luke. Net

Luke. Net can view the index content generated by Lucene. Net, which is particularly helpful in the Development and debugging of Lucene.

Refer:

Lucene. Net ultra fast search for MVC or WebForms site

Lucene. Net-Custom Synonym Analyzer

Https://github.com/JimLiu/Lucene.Net.Analysis.PanGu

Http://pangusegment.codeplex.com/wikipage? Title = PanGu4Lucene

Http://luke.codeplex.com/releases/view/82033

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More