Integration of jieba. NET and Lucene. Net,

Source: Internet
Author: User

[Switch] integration of jieba. NET and Lucene. Net,

First declare: I'm not familiar with Lucene. Net,SearchIt is indeed an important application of word segmentation, so here we try to integrate the two, which may be a reference for you.

We can see two Chinese Word Segmentation and Lucene. net Integration Project: Lucene. net. analysis. panGu and Lucene. net. analysis. MMSeg. Refer to the Code to implement the simplest integration: jiebaForLuceneNet. The following is a brief introduction.

1. JiebaTokenizer

The main integration point is to customizeTokenizerIn this case, you must implement its abstract method.IncrementTokenThis method is used to traverse the tokens generated by text in a text stream. This is where the word segmentation component plays a role.

public override bool IncrementToken(){    ClearAttributes();    position++;    if (position < tokens.Count)    {        var token = tokens[position];        termAtt.SetTermBuffer(token.Word);        offsetAtt.SetOffset(token.StartIndex, token.EndIndex);        typeAtt.Type = "Jieba";        return true;    }    End();    return false;}

The two lines of code where termAtt and offsetAtt are located need to use the word, start index, and end index of each token. These three values exactly matchJiebaSegmenter. TokenizeMethod, so you only need to use it when initializing JiebaTokenizer:

tokens = segmenter.Tokenize(text, TokenizerMode.Search).ToList();

You can get all tokens obtained by word segmentation, And the TokenizerMode. the Search parameter allows the results of the Tokenize method to contain more comprehensive word segmentation results. For example, the "linguistics" will get four tokens, namely, "[language, (0, 2)], [scientist, (2, 4)], [linguistics, (0, 3)], [linguistics, (0, 4)] ", which is helpful in index creation and search.

2. JiebaAnalyzer

Tokenizer class implements word segmentation, while Analyzer is required to add indexes and search. JiebaAnalyzer only needs to call JiebaTokenizer.

public override TokenStream TokenStream(string fieldName, TextReader reader){    var seg = new JiebaSegmenter();    TokenStream result = new JiebaTokenizer(seg, reader);    // This filter is necessary, because the parser converts the queries to lower case.    result = new LowerCaseFilter(result);    result = new StopFilter(true, result, StopWords);    return result;}

In addition to JiebaTokenizer, JiebaAnalyzer also usesLowerCaseFilterAndStopFilter. The former can normalize the index and search content, regardless of Case sensitivity. The latter can filter out deprecated words. The stopword list used here combines the English stopword of NLTK with the Chinese stopword of Harbin Institute of Technology.

3. Create indexes and search

When creating an index, IndexWriter uses the JiebaAnalyzer instance:

var analyzer = new JiebaAnalyzer();using (var writer = new IndexWriter(Directory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED)){    // replaces older entry if any    foreach (var sd in data)    {        AddToLuceneIndex(sd, writer);    }    analyzer.Close();}

When searching, we first split the user's input into words:

private static string GetKeyWordsSplitBySpace(string keywords, JiebaTokenizer tokenizer){    var result = new StringBuilder();    var words = tokenizer.Tokenize(keywords);    foreach (var word in words)    {        if (string.IsNullOrWhiteSpace(word.Word))        {            continue;        }        result.AppendFormat("{0} ", word.Word);    }    return result.ToString().Trim();}

For example, if the user inputs a "linguistic", the return value of this function is "linguistic linguistics", which prepares for the subsequent search (In addition, we can also add a * for each word so that the result can be searched as long as part of the match ). The final search implementation is:

private static IEnumerable<News> SearchQuery(string searchQuery, string searchField = ""){    if (string.IsNullOrEmpty(searchQuery.Replace("*", "").Replace("?", "")))    {        return new List<News>();    }    using (var searcher = new IndexSearcher(Directory, false))    {        var hitsLimit = 1000;        //var analyzer = new StandardAnalyzer(Version.LUCENE_30);        var analyzer = GetAnalyzer();        if (!string.IsNullOrEmpty(searchField))        {            var parser = new QueryParser(Version.LUCENE_30, searchField, analyzer);            var query = ParseQuery(searchQuery, parser);            var hits = searcher.Search(query, hitsLimit).ScoreDocs;            var results = MapLuceneToDataList(hits, searcher);            analyzer.Dispose();            return results;        }        else        {            var parser = new MultiFieldQueryParser(Version.LUCENE_30, new[] { "Id", "Title", "Content" }, analyzer);            var query = ParseQuery(searchQuery, parser);            var hits = searcher.Search(query, null, hitsLimit, Sort.RELEVANCE).ScoreDocs;            var results = MapLuceneToDataList(hits, searcher);            analyzer.Close();            return results;        }    }}

The searchField parameter can specify a specific field for search. If it is null, all fields are searched. So far, the most basic integration has been implemented.

The implementation and sample code of JiebaTokenizer and JiebaAnalyzer can be found in jiebaForLuceneNet.

4. Luke. Net

Luke. Net can view the index content generated by Lucene. Net, which is particularly helpful in the Development and debugging of Lucene.

 

Refer:

Lucene. Net ultra fast search for MVC or WebForms site

Lucene. Net-Custom Synonym Analyzer

Https://github.com/JimLiu/Lucene.Net.Analysis.PanGu

Http://pangusegment.codeplex.com/wikipage? Title = PanGu4Lucene

Http://luke.codeplex.com/releases/view/82033

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.