1. Requirements
Environment:
Lucene 4.1 version/ikanalyzer FF version/mmseg4j 1.9 version
Implementation features:
1). Given the input text, get the Chinese word break result;
2). Given the input text, the weight of the text is graded according to certain rules, such as: the higher the frequency of the text containing the specified keyword, the higher the score.
2. Implementing the Code
Package Com.clzhang.sample.lucene;import Java.io.*;import Java.util.*;import Org.apache.lucene.analysis.Analyzer; Import Com.chenlb.mmseg4j.dictionary;import Com.chenlb.mmseg4j.analysis.simpleanalyzer;import Com.chenlb.mmseg4j.analysis.complexanalyzer;import Com.chenlb.mmseg4j.analysis.maxwordanalyzer;import Org.wltea.analyzer.lucene.ikanalyzer;import Org.apache.lucene.analysis.tokenstream;import org.apache.lucene.analysis.tokenattributes.chartermattribute;/** * Environment: Lucene 4.1 version/ikanalyzer FF version/mmseg4j 1.9 Version * 1. Given the input text, get the Chinese word break result; * 2. Given the input text, the weight of the text is graded according to certain rules; * For example, the higher the frequency of the text containing the specified keyword, the higher the score. * @author Administrator * */public class Analyzertool {//mmseg4j dictionary path private static final String Mmseg4j_dict_path = "c:\\solr\\news\\conf"; private static Dictionary Dictionary = Dictionary.getinstance (Mmseg4j_dict_path); Negative keyword information, if the text contains these words, then the score value of the text will be higher. private static list<string> Lstnegativeword; static {Lstnegativeword = new arraylist<string> (); The following words must be present in the dictionary: either a dictionary with a word breaker, or a custom dictionary;//Otherwise the calculation weight result is not allowed, because there are keywords are not broken out by the word breaker. Lstnegativeword.add ("indecent"); Lstnegativeword.add ("be Exempt"); Lstnegativeword.add ("candid"); }/** * Tests the parsing results of various parsers on the same text * @param content * @throws Exception */public static void Testanalyze R (String content) throws Exception {Analyzer Analyzer = new Ikanalyzer ();//Equals new Ikanalyzer (false); System.out.println ("New Ikanalyzer () Parse output:" + getanalyzedstr (Analyzer, content)); Analyzer = new Ikanalyzer (true); System.out.println ("New Ikanalyzer (True) Parse output:" + getanalyzedstr (Analyzer, content)); Analyzer = new Simpleanalyzer (dictionary); System.out.println ("New Simpleanalyzer () Parse output:" + getanalyzedstr (Analyzer, content)); Analyzer = new Complexanalyzer (dictionary); System.out.println ("New Complexanalyzer () Parse output:" + getanalyzedstr (Analyzer, content)); Analyzer = new Maxwordanalyzer (dictionary); System.out.println ("New Maxwordanalyzer () Parse output:" + getanalyzedstr (Analyzer, content)); }/** * Get weight result, rule: Find keywords in input string, keyword appears more frequently, weight higher * @param str * @return * @throws Exception * * public static float Getboost (String str) throws Exception {float result = 1.0F; The default parser, which can be changed to another parser analyzer = new Ikanalyzer ();//Analyzer analyzer = new Simpleanalyzer (Dictionar y); list<string> list = GETANALYZEDSTR (analyzer, str); for (String Word:lstnegativeword) {if (List.contains (word)) {result + = 10F;//each occurrence of a negative keyword (regardless of the occurrence of a few Times), the score plus Ten}} return result; }/** * Call the word breaker to parse the input, add each word breaker to the list, and return to this list * @param content * @return * @throws Exception * * public static list<string> Getanalyzedstr (Analyzer Analyzer, String content) throws Exception {Tokenstream stream = Analyzer.tokenstream (null, new StRingreader (content)); Chartermattribute term = stream.addattribute (Chartermattribute.class); list<string> result = new arraylist<string> (); while (Stream.incrementtoken ()) {Result.add (term.tostring ()); } return result; } public static void Main (string[] args) throws Exception {//Note: Pavilion Lake New Area/Pavilion Lake These two words must exist in the ikanalyzer/mmseg4j two user custom dictionaries String content = "Pavilion Lake New area due to indecent difficult video is exempt from official state-owned enterprise bosses list"; System.out.println ("Original:" + content); Testanalyzer (content); SYSTEM.OUT.PRINTLN ("Default parser score Result:" + getboost (content)); }}
Output:
Original: Pavilion Lake New Area due to indecent difficult video was exempt official state-owned bosses list announced
Load Extension Dictionary: Ext.dic
......
Load extension Stop dictionary: stopword.dic
New Ikanalyzer () Analytic output: [Pavilion Lake area, Pavilion Lake, new area, because, indecent, sad, excessive, video, exempt, officer, official, SOE, Mister, List, announcement]
New Ikanalyzer (True) Parse output: [Pavilion Lake area, because, indecent, difficult, excessive, video, exempt, official, SOE, Mister, List, announcement]
New Simpleanalyzer () Analytic output: [Pavilion Lake area, because, indecent, sad, minute, video, be, exempt, official, SOE, Mister, List, announcement]
New Complexanalyzer () Analytic output: [Pavilion Lake area, because, indecent, sad, minute, video, be, exempt, official, SOE, Mister, List, announcement]
New Maxwordanalyzer () Analytic output: [Pavilion Lake, Xin Qu, because, indecent, sad, minute, video, be, exempt, official, SOE, Mister, List, announcement]
Default parser score Result: 21.0
Lucene4: Get Chinese Word segmentation results, calculate boost based on text