Lucene4: Get Chinese Word segmentation results, calculate boost based on text

Source: Internet
Author: User

1. Requirements

Environment:

Lucene 4.1 version/ikanalyzer FF version/mmseg4j 1.9 version

Implementation features:

1). Given the input text, get the Chinese word break result;
2). Given the input text, the weight of the text is graded according to certain rules, such as: the higher the frequency of the text containing the specified keyword, the higher the score.

2. Implementing the Code
Package Com.clzhang.sample.lucene;import Java.io.*;import Java.util.*;import Org.apache.lucene.analysis.Analyzer; Import Com.chenlb.mmseg4j.dictionary;import Com.chenlb.mmseg4j.analysis.simpleanalyzer;import Com.chenlb.mmseg4j.analysis.complexanalyzer;import Com.chenlb.mmseg4j.analysis.maxwordanalyzer;import Org.wltea.analyzer.lucene.ikanalyzer;import Org.apache.lucene.analysis.tokenstream;import org.apache.lucene.analysis.tokenattributes.chartermattribute;/** * Environment: Lucene 4.1 version/ikanalyzer FF version/mmseg4j 1.9 Version * 1. Given the input text, get the Chinese word break result; * 2. Given the input text, the weight of the text is graded according to certain rules; * For example, the higher the frequency of the text containing the specified keyword, the higher the score.  * @author Administrator * */public class Analyzertool {//mmseg4j dictionary path private static final String Mmseg4j_dict_path    = "c:\\solr\\news\\conf";        private static Dictionary Dictionary = Dictionary.getinstance (Mmseg4j_dict_path);    Negative keyword information, if the text contains these words, then the score value of the text will be higher.        private static list<string> Lstnegativeword;   static {Lstnegativeword = new arraylist<string> ();             The following words must be present in the dictionary: either a dictionary with a word breaker, or a custom dictionary;//Otherwise the calculation weight result is not allowed, because there are keywords are not broken out by the word breaker.        Lstnegativeword.add ("indecent");        Lstnegativeword.add ("be Exempt");    Lstnegativeword.add ("candid"); }/** * Tests the parsing results of various parsers on the same text * @param content * @throws Exception */public static void Testanalyze        R (String content) throws Exception {Analyzer Analyzer = new Ikanalyzer ();//Equals new Ikanalyzer (false);        System.out.println ("New Ikanalyzer () Parse output:" + getanalyzedstr (Analyzer, content));        Analyzer = new Ikanalyzer (true);                System.out.println ("New Ikanalyzer (True) Parse output:" + getanalyzedstr (Analyzer, content));        Analyzer = new Simpleanalyzer (dictionary);        System.out.println ("New Simpleanalyzer () Parse output:" + getanalyzedstr (Analyzer, content));        Analyzer = new Complexanalyzer (dictionary);                System.out.println ("New Complexanalyzer () Parse output:" + getanalyzedstr (Analyzer, content)); Analyzer = new Maxwordanalyzer (dictionary);       System.out.println ("New Maxwordanalyzer () Parse output:" + getanalyzedstr (Analyzer, content));    }/** * Get weight result, rule: Find keywords in input string, keyword appears more frequently, weight higher * @param str * @return * @throws Exception * *                public static float Getboost (String str) throws Exception {float result = 1.0F; The default parser, which can be changed to another parser analyzer = new Ikanalyzer ();//Analyzer analyzer = new Simpleanalyzer (Dictionar        y);        list<string> list = GETANALYZEDSTR (analyzer, str); for (String Word:lstnegativeword) {if (List.contains (word)) {result + = 10F;//each occurrence of a negative keyword (regardless of the occurrence of a few    Times), the score plus Ten}} return result;    }/** * Call the word breaker to parse the input, add each word breaker to the list, and return to this list * @param content * @return * @throws Exception * * public static list<string> Getanalyzedstr (Analyzer Analyzer, String content) throws Exception {Tokenstream stream = Analyzer.tokenstream (null, new StRingreader (content));                Chartermattribute term = stream.addattribute (Chartermattribute.class);        list<string> result = new arraylist<string> ();        while (Stream.incrementtoken ()) {Result.add (term.tostring ());    } return result;        } public static void Main (string[] args) throws Exception {//Note: Pavilion Lake New Area/Pavilion Lake These two words must exist in the ikanalyzer/mmseg4j two user custom dictionaries                String content = "Pavilion Lake New area due to indecent difficult video is exempt from official state-owned enterprise bosses list";        System.out.println ("Original:" + content);        Testanalyzer (content);    SYSTEM.OUT.PRINTLN ("Default parser score Result:" + getboost (content)); }}

Output:

Original: Pavilion Lake New Area due to indecent difficult video was exempt official state-owned bosses list announced
Load Extension Dictionary: Ext.dic
......
Load extension Stop dictionary: stopword.dic
New Ikanalyzer () Analytic output: [Pavilion Lake area, Pavilion Lake, new area, because, indecent, sad, excessive, video, exempt, officer, official, SOE, Mister, List, announcement]
New Ikanalyzer (True) Parse output: [Pavilion Lake area, because, indecent, difficult, excessive, video, exempt, official, SOE, Mister, List, announcement]
New Simpleanalyzer () Analytic output: [Pavilion Lake area, because, indecent, sad, minute, video, be, exempt, official, SOE, Mister, List, announcement]
New Complexanalyzer () Analytic output: [Pavilion Lake area, because, indecent, sad, minute, video, be, exempt, official, SOE, Mister, List, announcement]
New Maxwordanalyzer () Analytic output: [Pavilion Lake, Xin Qu, because, indecent, sad, minute, video, be, exempt, official, SOE, Mister, List, announcement]
Default parser score Result: 21.0

Lucene4: Get Chinese Word segmentation results, calculate boost based on text

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.