Lucene4: Get Chinese Word segmentation results, calculate boost based on text

Last Update:2018-07-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Requirements

Environment:

Lucene 4.1 version/ikanalyzer FF version/mmseg4j 1.9 version

Implementation features:

1). Given the input text, get the Chinese word break result;
2). Given the input text, the weight of the text is graded according to certain rules, such as: the higher the frequency of the text containing the specified keyword, the higher the score.

2. Implementing the Code

Package Com.clzhang.sample.lucene;import Java.io.*;import Java.util.*;import Org.apache.lucene.analysis.Analyzer; Import Com.chenlb.mmseg4j.dictionary;import Com.chenlb.mmseg4j.analysis.simpleanalyzer;import Com.chenlb.mmseg4j.analysis.complexanalyzer;import Com.chenlb.mmseg4j.analysis.maxwordanalyzer;import Org.wltea.analyzer.lucene.ikanalyzer;import Org.apache.lucene.analysis.tokenstream;import org.apache.lucene.analysis.tokenattributes.chartermattribute;/** * Environment: Lucene 4.1 version/ikanalyzer FF version/mmseg4j 1.9 Version * 1. Given the input text, get the Chinese word break result; * 2. Given the input text, the weight of the text is graded according to certain rules; * For example, the higher the frequency of the text containing the specified keyword, the higher the score.  * @author Administrator * */public class Analyzertool {//mmseg4j dictionary path private static final String Mmseg4j_dict_path    = "c:\\solr\\news\\conf";        private static Dictionary Dictionary = Dictionary.getinstance (Mmseg4j_dict_path);    Negative keyword information, if the text contains these words, then the score value of the text will be higher.        private static list<string> Lstnegativeword;   static {Lstnegativeword = new arraylist<string> ();             The following words must be present in the dictionary: either a dictionary with a word breaker, or a custom dictionary;//Otherwise the calculation weight result is not allowed, because there are keywords are not broken out by the word breaker.        Lstnegativeword.add ("indecent");        Lstnegativeword.add ("be Exempt");    Lstnegativeword.add ("candid"); }/** * Tests the parsing results of various parsers on the same text * @param content * @throws Exception */public static void Testanalyze        R (String content) throws Exception {Analyzer Analyzer = new Ikanalyzer ();//Equals new Ikanalyzer (false);        System.out.println ("New Ikanalyzer () Parse output:" + getanalyzedstr (Analyzer, content));        Analyzer = new Ikanalyzer (true);                System.out.println ("New Ikanalyzer (True) Parse output:" + getanalyzedstr (Analyzer, content));        Analyzer = new Simpleanalyzer (dictionary);        System.out.println ("New Simpleanalyzer () Parse output:" + getanalyzedstr (Analyzer, content));        Analyzer = new Complexanalyzer (dictionary);                System.out.println ("New Complexanalyzer () Parse output:" + getanalyzedstr (Analyzer, content)); Analyzer = new Maxwordanalyzer (dictionary);       System.out.println ("New Maxwordanalyzer () Parse output:" + getanalyzedstr (Analyzer, content));    }/** * Get weight result, rule: Find keywords in input string, keyword appears more frequently, weight higher * @param str * @return * @throws Exception * *                public static float Getboost (String str) throws Exception {float result = 1.0F; The default parser, which can be changed to another parser analyzer = new Ikanalyzer ();//Analyzer analyzer = new Simpleanalyzer (Dictionar        y);        list<string> list = GETANALYZEDSTR (analyzer, str); for (String Word:lstnegativeword) {if (List.contains (word)) {result + = 10F;//each occurrence of a negative keyword (regardless of the occurrence of a few    Times), the score plus Ten}} return result;    }/** * Call the word breaker to parse the input, add each word breaker to the list, and return to this list * @param content * @return * @throws Exception * * public static list<string> Getanalyzedstr (Analyzer Analyzer, String content) throws Exception {Tokenstream stream = Analyzer.tokenstream (null, new StRingreader (content));                Chartermattribute term = stream.addattribute (Chartermattribute.class);        list<string> result = new arraylist<string> ();        while (Stream.incrementtoken ()) {Result.add (term.tostring ());    } return result;        } public static void Main (string[] args) throws Exception {//Note: Pavilion Lake New Area/Pavilion Lake These two words must exist in the ikanalyzer/mmseg4j two user custom dictionaries                String content = "Pavilion Lake New area due to indecent difficult video is exempt from official state-owned enterprise bosses list";        System.out.println ("Original:" + content);        Testanalyzer (content);    SYSTEM.OUT.PRINTLN ("Default parser score Result:" + getboost (content)); }}

Output:

Original: Pavilion Lake New Area due to indecent difficult video was exempt official state-owned bosses list announced
Load Extension Dictionary: Ext.dic
......
Load extension Stop dictionary: stopword.dic
New Ikanalyzer () Analytic output: [Pavilion Lake area, Pavilion Lake, new area, because, indecent, sad, excessive, video, exempt, officer, official, SOE, Mister, List, announcement]
New Ikanalyzer (True) Parse output: [Pavilion Lake area, because, indecent, difficult, excessive, video, exempt, official, SOE, Mister, List, announcement]
New Simpleanalyzer () Analytic output: [Pavilion Lake area, because, indecent, sad, minute, video, be, exempt, official, SOE, Mister, List, announcement]
New Complexanalyzer () Analytic output: [Pavilion Lake area, because, indecent, sad, minute, video, be, exempt, official, SOE, Mister, List, announcement]
New Maxwordanalyzer () Analytic output: [Pavilion Lake, Xin Qu, because, indecent, sad, minute, video, be, exempt, official, SOE, Mister, List, announcement]
Default parser score Result: 21.0

Lucene4: Get Chinese Word segmentation results, calculate boost based on text

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Lucene4: Get Chinese Word segmentation results, calculate boost based on text

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Lucene4: Get Chinese Word segmentation results, calculate boost based on text

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support