Introduction to Lucene 3.6.2 (4) Chinese word breaker

Last Update:2017-02-27 Source: Internet

Author: User

Tags filter lowercase zip

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Package com.jadyer.lucene;  
Import java.io.IOException;  
      
Import Java.io.StringReader;  
Import Org.apache.lucene.analysis.Analyzer;  
Import Org.apache.lucene.analysis.SimpleAnalyzer;  
Import Org.apache.lucene.analysis.StopAnalyzer;  
Import Org.apache.lucene.analysis.TokenStream;  
Import Org.apache.lucene.analysis.WhitespaceAnalyzer;  
Import Org.apache.lucene.analysis.standard.StandardAnalyzer;  
Import Org.apache.lucene.analysis.tokenattributes.CharTermAttribute;  
Import Org.apache.lucene.analysis.tokenattributes.OffsetAttribute;  
Import Org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;  
Import Org.apache.lucene.analysis.tokenattributes.TypeAttribute;  
      
Import org.apache.lucene.util.Version;  
Import Com.chenlb.mmseg4j.analysis.ComplexAnalyzer;  
      
Import Com.chenlb.mmseg4j.analysis.MMSegAnalyzer; /** * "Lucene3.6.2 Introductory series" No. 04 section _ Chinese Word breaker * @see-------------------------------------------------------------------------- -------------

--------------------------------* @see Lucene3.5 recommended four major word breakers: Simpleanalyzer,stopanalyzer,whitespaceanalyzer, StandardAnalyzer * @see The four major participle have a common abstract parent class, this class has a method public final Tokenstream Tokenstream (), namely participle of a stream * @see Suppose to have such a text "How AR E You thank ", the actual it is a java.io.Reader into the word breaker * @see Lucene word breaker after processing, the whole Word will be converted to Tokenstream, this tokenstream to save all the participle information * The Tokenstream has two implementation classes, Tokenizer and Tokenfilter * @see tokenizer----> For dividing a set of data into separate vocabularies (that is, one word) * @see Tokenfilter --> Filter Vocabulary Unit * @see--------------------------------------------------------------------------------------------- --------------------------* @see Participle process * @see 1) Java.io.Reader a set of data streams to Tokenizer, which converts the data into a single token unit * @see 2) through a large number of token Filter filters The data that has been divided into words, resulting in Tokenstream * @see 3) The storage of the index through Tokenstream * @see----------------------------------------- Some subclasses of the------------------------------------------------------------------------------* @see tokenizer * @see Keywordt Okenizer-----Not participle, what on the index of what * @see StandardTokenizer----Standard participle, it has some more intelligent participle operations, such as the ' jadyer@yeah.net ' in the ' yeah.net ' as a word of the flow * @see Chartokenizer--------for character control, It also has two subclasses Whitespacetokenizer and Lettertokenizer * @see whitespacetokenizer--use a space for participle, such as the ' Thank you,i am jadyer ' will be divided into 4 words * @see lettertokenizer------Participle based on a text word, it will be based on punctuation, such as the ' Thank you,i am jadyer ' will be divided into 5 words * @see lowercasetokenizer---It is lette A subclass of Rtokenizer that converts data to lowercase and participle * @see------------------------------------------------------------------------------- Some subclasses of the----------------------------------------* @see tokenfilter * @see stopfilter--------It deactivates some vocabulary units * @see LOWERC Asefilter---Convert data to lowercase * @see standardfilter----To do some control on the standard output stream * @see porterstemfilter--restore Some data, such as restoring coming to come, Restore countries to Country * @see------------------------------------------------------------------------------------- ----------------------------------* @see eg: ' How are your thank you ' will be participle for ' how ', ' are ', ' you ', ' thank ', ' you ' equals 5 vocabulary units * @ What should you save in order to make sure that you restore the correct data when you need to restore it later??? 
 In fact, basically save three things, as shown below* @see Chartermattribute (Lucene3.5 formerly called Termattribute), Offsetattribute,positionincrementattribute * @see 1) Chartermattribute-----------Save the corresponding vocabulary, here is "how", ' Are ', ' you ', ' thank ', ' you ' * @see 2) Offsetattribute-------------Keep the offsets (roughly in order) between the words, such as ' how ' the initial and trailing letters are offset by 0 and 3, ' are ' is 4 and 7, ' thank ' is 12 and * @see 3)                               positionincrementattribute--saves the position increment between the word and the word, such as ' How ' and ' are ' increment is 1, ' are ' and ' you ' is also 1, ' You ' and ' thank ' is also 1 * @see But assuming ' are ' is a deactivated word (stopfilter effect), then the position increment between ' how ' and ' You ' becomes 2 * @see when we look for an element, Lucene first takes the element by position increment, but if the position of two words increases The same amount, what happens? * @see Suppose there is a word ' this ', its position increment and ' how ' are the same, then when we search for ' this ' in the interface * @see will also search for ' How to Are you thank ', so that you can effectively do the same The word, the current very popular thing called WordNet, you can do a synonym search * @see--------------------------------------------------------------------- --------------------------------------------------* @see Chinese word breaker * @see Lucene defaults to provide a lot of word breaker is completely not in Chinese * @see 1) paoding--Hotel Ding-Solution Cow word breaker, official website for http://code.google.com/p/paoding (seemingly has been hosted in Http://git.oschina.net/zhzhenqin/paoding-analysis) * @see 2) mmseg4j--It is said to use Sogou's thesaurus, the official website for HTTPS://CODE.GOOGLE.COM/P/MMSEG4J (also has a https://code.google.com/p/ JCSEG) * @ses 3) IK-------https://code.google.com/p/ik-analyzer/* @see--------------------------------------------- --------------------------------------------------------------------------* @see mmseg4j Use * @see 1) to download mmseg4j-1.8. 5.zip and Introduce Mmseg4j-all-1.8.5-with-dic.jar * @see 2) write new Mmseganalyzer () in the position where the word breaker needs to be specified () @see note 1) due to the use of mmseg4j-all-1.8.5-w Ith-dic.jar has its own dictionary, so direct new Mmseganalyzer () can * @see Note 2) If the introduction of Mmseg4j-all-1.8.5.jar, you should indicate the dictionary directory, such as the new Mmseganalyzer ("d:\\ Develop\\mmseg4j-1.8.5\\data ") * @see But if you want to use the new Mmseganalyzer (), mmseg4j-1.8.5.zip the data directory that you brought with you to Classpath * @s EE Summary: Direct introduction of Mmseg4j-all-1.8.5-with-dic.jar on the line * @see--------------------------------------------------------------- --------------------------------------------------------* @create Aug 2, 2013 5:30:45 PM * @author Xuan Yu

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More