Introduction to Lucene 3.6.2 (4) Chinese word breaker

Source: Internet
Author: User
Tags filter lowercase zip
Package com.jadyer.lucene;  
Import java.io.IOException;  
      
Import Java.io.StringReader;  
Import Org.apache.lucene.analysis.Analyzer;  
Import Org.apache.lucene.analysis.SimpleAnalyzer;  
Import Org.apache.lucene.analysis.StopAnalyzer;  
Import Org.apache.lucene.analysis.TokenStream;  
Import Org.apache.lucene.analysis.WhitespaceAnalyzer;  
Import Org.apache.lucene.analysis.standard.StandardAnalyzer;  
Import Org.apache.lucene.analysis.tokenattributes.CharTermAttribute;  
Import Org.apache.lucene.analysis.tokenattributes.OffsetAttribute;  
Import Org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;  
Import Org.apache.lucene.analysis.tokenattributes.TypeAttribute;  
      
Import org.apache.lucene.util.Version;  
Import Com.chenlb.mmseg4j.analysis.ComplexAnalyzer;  
      
Import Com.chenlb.mmseg4j.analysis.MMSegAnalyzer; /** * "Lucene3.6.2 Introductory series" No. 04 section _ Chinese Word breaker * @see-------------------------------------------------------------------------- -------------

--------------------------------* @see Lucene3.5 recommended four major word breakers: Simpleanalyzer,stopanalyzer,whitespaceanalyzer, StandardAnalyzer * @see The four major participle have a common abstract parent class, this class has a method public final Tokenstream Tokenstream (), namely participle of a stream * @see Suppose to have such a text "How AR E You thank ", the actual it is a java.io.Reader into the word breaker * @see Lucene word breaker after processing, the whole Word will be converted to Tokenstream, this tokenstream to save all the participle information * The Tokenstream has two implementation classes, Tokenizer and Tokenfilter * @see tokenizer----> For dividing a set of data into separate vocabularies (that is, one word) * @see Tokenfilter --> Filter Vocabulary Unit * @see--------------------------------------------------------------------------------------------- --------------------------* @see Participle process * @see 1) Java.io.Reader a set of data streams to Tokenizer, which converts the data into a single token unit * @see 2) through a large number of token Filter filters The data that has been divided into words, resulting in Tokenstream * @see 3) The storage of the index through Tokenstream * @see----------------------------------------- Some subclasses of the------------------------------------------------------------------------------* @see tokenizer * @see Keywordt Okenizer-----Not participle, what on the index of what * @see StandardTokenizer----Standard participle, it has some more intelligent participle operations, such as the ' jadyer@yeah.net ' in the ' yeah.net ' as a word of the flow * @see Chartokenizer--------for character control, It also has two subclasses Whitespacetokenizer and Lettertokenizer * @see whitespacetokenizer--use a space for participle, such as the ' Thank you,i am jadyer ' will be divided into 4 words * @see lettertokenizer------Participle based on a text word, it will be based on punctuation, such as the ' Thank you,i am jadyer ' will be divided into 5 words * @see lowercasetokenizer---It is lette A subclass of Rtokenizer that converts data to lowercase and participle * @see------------------------------------------------------------------------------- Some subclasses of the----------------------------------------* @see tokenfilter * @see stopfilter--------It deactivates some vocabulary units * @see LOWERC Asefilter---Convert data to lowercase * @see standardfilter----To do some control on the standard output stream * @see porterstemfilter--restore Some data, such as restoring coming to come, Restore countries to Country * @see------------------------------------------------------------------------------------- ----------------------------------* @see eg: ' How are your thank you ' will be participle for ' how ', ' are ', ' you ', ' thank ', ' you ' equals 5 vocabulary units * @ What should you save in order to make sure that you restore the correct data when you need to restore it later??? 
 In fact, basically save three things, as shown below* @see Chartermattribute (Lucene3.5 formerly called Termattribute), Offsetattribute,positionincrementattribute * @see 1) Chartermattribute-----------Save the corresponding vocabulary, here is "how", ' Are ', ' you ', ' thank ', ' you ' * @see 2) Offsetattribute-------------Keep the offsets (roughly in order) between the words, such as ' how ' the initial and trailing letters are offset by 0 and 3, ' are ' is 4 and 7, ' thank ' is 12 and * @see 3)                               positionincrementattribute--saves the position increment between the word and the word, such as ' How ' and ' are ' increment is 1, ' are ' and ' you ' is also 1, ' You ' and ' thank ' is also 1 * @see But assuming ' are ' is a deactivated word (stopfilter effect), then the position increment between ' how ' and ' You ' becomes 2 * @see when we look for an element, Lucene first takes the element by position increment, but if the position of two words increases The same amount, what happens? * @see Suppose there is a word ' this ', its position increment and ' how ' are the same, then when we search for ' this ' in the interface * @see will also search for ' How to Are you thank ', so that you can effectively do the same The word, the current very popular thing called WordNet, you can do a synonym search * @see--------------------------------------------------------------------- --------------------------------------------------* @see Chinese word breaker * @see Lucene defaults to provide a lot of word breaker is completely not in Chinese * @see 1) paoding--Hotel Ding-Solution Cow word breaker, official website for http://code.google.com/p/paoding (seemingly has been hosted in Http://git.oschina.net/zhzhenqin/paoding-analysis) * @see 2) mmseg4j--It is said to use Sogou's thesaurus, the official website for HTTPS://CODE.GOOGLE.COM/P/MMSEG4J (also has a https://code.google.com/p/ JCSEG) * @ses 3) IK-------https://code.google.com/p/ik-analyzer/* @see--------------------------------------------- --------------------------------------------------------------------------* @see mmseg4j Use * @see 1) to download mmseg4j-1.8. 5.zip and Introduce Mmseg4j-all-1.8.5-with-dic.jar * @see 2) write new Mmseganalyzer () in the position where the word breaker needs to be specified () @see note 1) due to the use of mmseg4j-all-1.8.5-w Ith-dic.jar has its own dictionary, so direct new Mmseganalyzer () can * @see Note 2) If the introduction of Mmseg4j-all-1.8.5.jar, you should indicate the dictionary directory, such as the new Mmseganalyzer ("d:\\ Develop\\mmseg4j-1.8.5\\data ") * @see But if you want to use the new Mmseganalyzer (), mmseg4j-1.8.5.zip the data directory that you brought with you to Classpath * @s EE Summary: Direct introduction of Mmseg4j-all-1.8.5-with-dic.jar on the line * @see--------------------------------------------------------------- --------------------------------------------------------* @create Aug 2, 2013 5:30:45 PM * @author Xuan Yu  

  
  

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.