Ictclas with the word Lucene4.9 bundle

Source: Internet
Author: User
Tags set set

It has been like the search direction, though cannot be done. But it still retains its fanatical share. Remember that summer, this lab, this group of people, everything has gone with the wind. Embark on a new journey. I didn't have myself before. In the face of 73 technology business environment, I chose precipitation. Society is a big machine, we are just a small screw. We cannot tolerate the slightest minced.

In the product of an era. Will eventually be abandoned by the times. The word to the point, in Lucene to add their own definition of the word breaker, need to inherit the Analyzer class. Implement the Createcomponents method. At the same time, define the Tokenzier class to record the desired index of the word and its position in the article, here inherit the Segmentingtokenizerbase class, need to implement Setnextsentence and Incrementword two methods. Of Setnextsentence set the next sentence, in the Multi-field (Filed) Word index, the setnextsentence is to set the contents of the next field, through the new String (buffer, Sentencestart, Sentenceend-sentencestart) get. The Incrementword method is to record each word and its location. One thing to note is to add clearattributes () in front of you, otherwise you may see first position increment must be > 0 ... error. Take the Ictclas word breaker as an example, the following affixed with personal code, hope to bring you help, shortcomings, a lot of shooting bricks.


Import Java.io.ioexception;import Java.io.reader;import Java.io.stringreader;import Org.apache.lucene.analysis.analyzer;import Org.apache.lucene.analysis.tokenstream;import Org.apache.lucene.analysis.tokenizer;import Org.apache.lucene.analysis.core.lowercasefilter;import Org.apache.lucene.analysis.en.porterstemfilter;import Org.apache.lucene.analysis.tokenattributes.chartermattribute;import org.apache.lucene.util.version;/** * Chinese Academy of Sciences word breaker Inherit the Analyzer class. Implement its Tokenstream method * * @author CKM * */public class Ictclasanalyzer extends Analyzer {/** * This method is mainly to convert the document into Lucene to build the index required Tok Enstream Object * * @param fieldName * file name * @param reader * File input stream */@Overrideprotected tokenstreamcom    Ponents createcomponents (String fieldName, reader Reader) {try {System.out.println (fieldName);    Final Tokenizer tokenizer = new Ictclastokenzier (reader);    Tokenstream stream = new Porterstemfilter (tokenizer);    stream = new Lowercasefilter (version.lucene_4_9, stream); stream = new Porterstemfilter(stream); return new tokenstreamcomponents (Tokenizer,stream);} catch (IOException e) {//TODO auto-generated catch Blocke.printstacktrace ();} return null;}          public static void Main (string[] args) throws Exception {Analyzer Analyzer = new Ictclasanalyzer ();          String str = "Hacker technology";          Tokenstream ts = analyzer.tokenstream ("field", New StringReader (str));          Chartermattribute C = Ts.addattribute (Chartermattribute.class);          Ts.reset ();          while (Ts.incrementtoken ()) {System.out.println (c.tostring ());          } ts.end ();      Ts.close (); }  }


Import Java.io.ioexception;import java.io.reader;import Java.text.breakiterator;import Java.util.Arrays;import Java.util.iterator;import Java.util.list;import Java.util.locale;import Org.apache.lucene.analysis.tokenattributes.chartermattribute;import Org.apache.lucene.analysis.tokenattributes.offsetattribute;import Org.apache.lucene.analysis.util.segmentingtokenizerbase;import org.apache.lucene.util.attributefactory;/** * * Inherit Lucene's Segmentingtokenizerbase, overload its setnextsentence with the * Incrementword method, record the words to be indexed and its position in the article * * @author CKM * */public C Lass Ictclastokenzier extends Segmentingtokenizerbase {private static final breakiterator Sentenceproto = Breakiterator.getsentenceinstance (locale.root);p rivate final Chartermattribute termattr= AddAttribute ( Chartermattribute.class)///record the desired index of the word private final offsetattribute offattr = AddAttribute (Offsetattribute.class);// Record the position required to index the location of the word in the article private ictclasdelegate ictclas;//Word system of the entrusted object private iterator<string> words;// The word private int o formed after the word participleffset= 0;//records the end position of the last word/** * constructor * * @param segmented participle result * @throws ioexception */protected ictclastokenzier (Re Ader reader) throws IOException {This (default_token_attribute_factory, reader);} Protected Ictclastokenzier (Attributefactory factory, Reader Reader) throws IOException {super (factory, Reader,  Sentenceproto); Ictclas = Ictclasdelegate.getdelegate (); } @Overrideprotected void setnextsentence (int sentencestart, int sentenceend) {//TODO auto-generated method stubstring SE Ntence = new String (buffer, Sentencestart, Sentenceend-sentencestart); String result=ictclas.process (sentence); string[] Array = result.split ("\\s"), if (array!=null) {list<string> List = arrays.aslist (array); words= List.iterator ();} offset= 0;}    @Overrideprotected Boolean Incrementword () {//TODO auto-generated method stubif (words = = NULL | |!words.hasnext ()) { return false;} else {String T = Words.next (); while (T.equals ("") | |                              Stopwordfilter.filter (t)) {//Here is primarily to filter for whitespace characters and to deactivate words            Stopwordfilter defines the Stop Word filter class if (t.length () = = 0) offset++;elseoffset+= t.length (); t =words.next ();} if (!t.equals ("") &&! Stopwordfilter.filter (t)) {clearattributes (); Termattr.copybuffer (T.tochararray (), 0, T.length ()); O Ffattr.setoffset (Correctoffset (OffSet), Correctoffset (offset=offset+ t.length ()), return true; return false;}} /** * Reset */public void Reset () throws IOException {Super.reset (); offset= 0;} public static void Main (string[] args) throws IOException {String content = "Bao Jianfeng from sharpening out, plum blossom fragrance from bitter cold!"

"; String seg = ictclasdelegate.getdelegate (). process (content);//ictclastokenzier test = new Ictclastokenzier (SEG);// while (Test.incrementtoken ());}}




Import Java.io.file;import java.nio.bytebuffer;import Java.nio.charbuffer;import Java.nio.charset.charset;import Ictclas. I3s. AC. ictclas50;/** * Chinese Academy of Sciences participle System proxy class * * @author CKM * */public class Ictclasdelegate {private static final String userdict = "Use RDict.txt ";//user dictionary Private final static Charset Charset = Charset.forname (" gb2312 ");//default encoding format private static String Ictcla spath =system.getproperty ("User.dir");p rivate static String dirconfigurate = "ictclasconf";//config file folder name private static String configurate = Ictclaspath + file.separator+ dirconfigurate;//The absolute path of the folder where the configuration file is located private static int Wordlabel = 2;//part-of-speech label Note Type (PKU two-level annotation set) private static ICTCLAS50 ictclas;//the Jni interface object of Chinese Academy of Sciences word system private static ictclasdelegate instance = Null;private Ictclasdelegate () {}/** * Initialize ICTCLAS50 Object * * @return ICTCLAS50 Object Initialization succeeded */public Boolean init () {Ictclas = new ICTCLAS50 (); boolean bool = Ictclas. Ictclas_init (Configurate.getbytes (CharSet)), if (bool = = False) {System.out.println ("Init fail!"); return false;} Set up part-of-speech annotation sets (0 compute the two-level annotation set. 1 calculation of the first-level label set, 2 PKU two-level annotation set, 3 north-level annotation set) Ictclas. Ictclas_setposmap (Wordlabel); Importuserdictfile (configurate + file.separator + userdict);//import user dictionary ICTCLAS. Ictclas_savetheusrdic ();//save user dictionary return true;} /** * Converts the encoding format to the type of word breaker recognition * * @param CHARSET * encoded format * @return encoded format corresponding digital **/public static int Getecode (CharSet ch Arset) {String name = Charset.name (), if (Name.equalsignorecase ("ASCII")) return 1;if (Name.equalsignorecase ("gb2312")) Return 2;if (Name.equalsignorecase ("GBK")) return 2;if (Name.equalsignorecase ("UTF8")) return 3;if ( Name.equalsignorecase ("Utf-8")) return 3;if (Name.equalsignorecase ("Big5")) return 4;return 0;} /** * The function of this method is to import the user dictionary * * @param path * The absolute path of the user dictionary * @return returns the number of words in the imported dictionary */public int Importuserdictfile (String Path) {System.out.println ("Import user dictionary"); return Ictclas. Ictclas_importuserdictfile (Path.getbytes (CharSet), Getecode (CharSet));} /** * The function of this method is to make a word participle of a string * * @param the source data to be participle * @return The result of the word breaker */public string process (string source) {RE TuRN Process (Source.getbytes (charset));}   Public String process (char[] chars) {Charbuffer cb = charbuffer.allocate (chars.length);   Cb.put (chars);   Cb.flip ();   Bytebuffer BB = Charset.encode (CB);   return process (Bb.array ()); }public String process (byte[] bytes) {if (bytes==null| | bytes.length<1) return null;byte nativebytes[] = Ictclas. Ictclas_paragraphprocess (bytes, 2, 0); String nativestr = new String (Nativebytes, 0,nativebytes.length-1, CharSet); return nativestr;} /** * Get Word Breaker proxy object * * @return Word breaker proxy object */public static Ictclasdelegate getdelegate () {if (instance = null) {synchronized (Ictclasdelegate.class) {instance = new Ictclasdelegate (); Instance.init ();}} return instance;} /** * Exit Word breaker * * @return return operation succeeded */public Boolean exit () {return ictclas. Ictclas_exit ();} public static void Main (string[] args) {String str= "married monk Not married"; ictclasdelegate id = ictclasdelegate.getdelegate (); String result = Id.process (Str.tochararray ()); System.out.println (Result.replaceall ("", "-"));}}


Import Java.util.iterator;import java.util.set;import java.util.regex.matcher;import java.util.regex.Pattern;/** * Disable Word filter * * @author CKM * */public class Stopwordfilter {private static set<string> Chinesestopwords = null;//Chinese disabled Word set private static set<string> Englishstopwords = null;//English disable word set static {init ();} /** * Initialize in English disable word set */public static void Init () {loadstopwords LSW = new Loadstopwords (); chinesestopwords = Lsw.getchinesesto Pwords (); englishstopwords = Lsw.getenglishstopwords ();} /** * Infer the type of keyword and infer whether it is a deactivated word note: Temporary only consider Chinese, English. Mixed in Chinese and English, the number of mixed, the number of English mixed these five types. Among the Chinese and the English mixed, * medium number mixed, the English number mix has not been specific to stop the thesaurus or the grammatical rule to discriminate against it * * @param word * keyword * @return true means to deactivate the word */public static Boole An filter (String word) {Pattern Chinese = pattern.compile ("^[\u4e00-\u9fa5]+$");//Chinese match Matcher m1 = Chinese.matcher (Word ); if (M1.find ()) return chinesefilter (word); Pattern 中文版 = Pattern.compile ("^[a-za-z]+$");//english match matcher m2 = english.matcher (word); if (M2.find ()) return Englishfilter (word); Pattern Chinesedigit = pattern.compile ("^[\u4e00-\u9fa50-9]+$");//median match matcher m3 = chinesedigit.matcher (word); if (M3.find ()) return Chinesedigitfilter (word); Pattern englishdigit = Pattern.compile ("^[a-za-z0-9]+$");//English match Matcher m4 = Englishdigit.matcher (word); if (M4.find ()) return Englishdigitfilter (word); Pattern Englishchinese = Pattern.compile ("^[a-za-z\u4e00-\u9fa5]+$");//Chinese-English match, this must be matched in English after the match between matcher M5 = Englishchinese.matcher (Word), if (M5.find ()) return Englishchinesefilter (Word), return true;} /** * Infer if keyword is a chinese stop word * * @param word * keyword * @return true means to deactivate the word */public static Boolean chinesefilter (S Tring Word) {//System.out.println ("Chinese discontinued word inference"); if (chinesestopwords = = NULL | | chinesestopwords.size () = = 0) return false;i Terator<string> iterator = Chinesestopwords.iterator (); while (Iterator.hasnext ()) {if (Iterator.next (). Equals ( Word)) return true;} return false;} /** * Infer whether keyword is an english stop word * * @param word * keyword * @return true means to deactivate the word */public static Boolean englishfilter (S Tring word) {//System.out.println ("Stop word Inference in English"); if (Word.length () <= 2) return true;if (englishstopwords = null | | englishs Topwords.size () = = 0) return false;iterator<string> Iterator = Englishstopwords.iterator (); while ( Iterator.hasnext ()) {if (Iterator.next (). Equals (Word)) return true; return false;} /** * Infer whether keyword is an english stop word * * @param word * keyword * @return True to indicate that it is a deactivated word */public static Boolean Englishdigitfil ter (String Word) {return false;} /** * Infer if keyword is a median stop word * * @param word * keyword * @return true to mean to deactivate the word */public static Boolean Chinesedigitfil ter (String Word) {return false;} /** * Infer whether keyword is a english-Chinese stop word * * @param word * keyword * @return true means to deactivate the word */public static Boolean Englishchinesef Ilter (String word) {return false;} public static void Main (string[] args) {/* * iterator<string> iterator= * StopWordFilter.chineseStopWords.iterator (); int n=0; * while (Iterator.hasnext ()) {System.out.println (Iterator.next ()); n++; *} System.out.println ("Total orderWord Volume: "+n"); */boolean bool = Stopwordfilter.filter ("sword"); SYSTEM.OUT.PRINTLN (bool);}}

Import Java.io.bufferedreader;import Java.io.ioexception;import Java.io.inputstream;import Java.io.inputstreamreader;import java.util.hashset;import java.util.iterator;import java.util.Set;/** * Load the deactivation Word file * * @ Author CKM * */public class Loadstopwords {private set<string> chinesestopwords = null;//Chinese stop word set private Set<stri ng> englishstopwords = null;//English discontinued word set/** * get Chinese discontinued word set * * @return Chinese disabled word set set<string> type */public set<string> g Etchinesestopwords () {return chinesestopwords;} /** * Set Chinese stop word set * * @param chinesestopwords * Chinese disabled word set set<string> type */public void Setchinesestopwords (SET&L T String> chinesestopwords) {this.chinesestopwords = Chinesestopwords;} /** * Get English discontinued Word set * * @return English stop Word set set<string> type */public set<string> getenglishstopwords () {return englishstop Words;} /** * Set English stop Word set * * @param englishstopwords * English disable Word set set<string> type */public void Setenglishstopwords (SET&L T String> englishstopwords) {this.englishstopwords = Englishstopwords;} /** * Load the Inactive thesaurus */public loadstopwords () {chinesestopwords = Loadstopwords (This.getclass (). getResourceAsStream (" ChineseStopWords.txt ")); englishstopwords = Loadstopwords (This.getclass (). getResourceAsStream (" EnglishStopWords.txt "));} /** * The inactive word is loaded from the deactivation Word file, and the deactivation Word file is a normal GBK encoded text file, each line is a stop word. Gaze uses the word "//", which contains Chinese punctuation marks, * Chinese spaces, and words that are too high and have little meaning to the index. * * @param input * Stop Word file stream * @return inactive words hashset */public static set<string> loadstopwords (InputStream Input) {String line; set<string> stopwords = new hashset<string> (); try {bufferedreader br = new BufferedReader (new InputStreamReader (Input, "GBK")); (line = Br.readline ()) = null) {if (Line.indexof ("//")! =-1) {line = Line.substri Ng (0, Line.indexof ("//"));} line = Line.trim (); if (Line.length ()! = 0) Stopwords.add (line.tolowercase ());} Br.close ();} catch (IOException e) {System.err.println ("Cannot open the inactive thesaurus!. ");} return stopwords;} public static void Main (string[] args) {loadstopwords LSW = new Loadstopwords (); IteratoR<string> iterator = Lsw.getenglishstopwords (). iterator (); int n = 0;while (Iterator.hasnext ()) { System.out.println (Iterator.next ()); n++;} SYSTEM.OUT.PRINTLN ("Total word amount:" + N);}}

Here need ChineseStopWords.txt and EnglishStopWords.txt China and the UK both store the discontinued words, here we do not know how to upload, have ictclas basic files.

Download the complete project: http://download.csdn.net/detail/km1218/7754907




Ictclas with the word Lucene4.9 bundle

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.