First is the Hellocustomanalyzer.java used to display participle information
Package com.jadyer.lucene;
Import java.io.IOException;
Import Java.io.StringReader;
Import Org.apache.lucene.analysis.Analyzer;
Import Org.apache.lucene.analysis.TokenStream;
Import Org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
Import Org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
Import Org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
Import Org.apache.lucene.analysis.tokenattributes.TypeAttribute; /** * "Lucene3.6.2 Introductory Series" section No. 05 _ Custom Word breaker * @see------------------------------------------------------------------------- ----------------------------------------------* @see Lucene3.5 recommended four major word breakers: Simpleanalyzer,stopanalyzer, Whitespaceanalyzer,standardanalyzer * @see The four major participle have a common abstract parent class, this class has a method public final Tokenstream Tokenstream (), that is, a stream of participle * @ If there is such a text "How are you thank", the actual it is a java.io.Reader into the word breaker * @see Lucene word breaker after processing, the whole Word will be converted to Tokenstream, In this tokenstream, all the participle information is saved * @see Tokenstream has twoImplementation classes, Tokenizer and Tokenfilter * @see tokenizer----> is used to divide a set of data into separate vocabularies (that is, one word) * @see tokenfilter--> Filter Token unit * @s EE------------------------------------------------------------------------------------------------------------- ----------* @see Participle process * @see 1) Java.io.Reader a set of data streams to Tokenizer, which converts the data into a single token unit * @see 2) through a large number of tokenfilter to the already-divided-word data into Row filtering operations, resulting in Tokenstream * @see 3) The storage of the index through Tokenstream * @see--------------------------------------------------------- Some subclasses of the--------------------------------------------------------------* @see tokenizer * @see keywordtokenizer-----No participle , what is the index? * @see standardtokenizer----Standard participle, it has some more intelligent participle operations, such as the ' jadyer@yeah.net ' in the ' yeah.net ' as a word of the flow * @see Chartokenizer --------is controlled for characters, it also has two subclasses Whitespacetokenizer and Lettertokenizer * @see whitespacetokenizer--use a space for participle, such as the ' Thank you,i Am Jadyer ' will be divided into 4 words * @see lettertokenizer------participle based on the text word, which will be based on punctuation, such as the ' Thank you,i am jadyer ' will be divided into 5 words * @see Lowerca Setokenizer---It is a lettertokenizer subclass that willdata to lowercase and participle * @see------------------------------------------------------------------------------------------------ Some subclasses of the-----------------------* @see tokenfilter * @see stopfilter--------It deactivates some vocabulary units * @see Lowercasefilter---Convert the data to a small Write * @see standardfilter----Make some control of the standard output stream * @see porterstemfilter--restore Some data, such as restoring coming to come, and countries to country * @see ---------------------------------------------------------------------------------------
------------------------ --------* @see eg: ' How are your thank you ' will be participle for ' how ', ' are ', ' you ', ' thank ', ' you ' equals 5 vocabulary units * @see Then what should be saved so that later when you need to restore the data Ensure the correct restoration??? In fact, the main preservation of three things, as shown below * @see Chartermattribute (Lucene3.5 formerly known as Termattribute), Offsetattribute,positionincrementattribute * (1) Chartermattribute-----------Save the corresponding vocabulary, here is the ' how ', ' are ', ' you ', ' thank ', ' you ' * @see 2) Offsetattribute-------------Keep the offsets (roughly in order) between the words, such as ' how ' the initial and trailing letters are offset by 0 and 3, ' are ' is 4 and 7, ' thank ' is 12 and * @see 3) positionincrementattribute--saves position increments between words, such as ' How ' and ' are ' increments of 1, ' are ' and ' you' is also 1, ' You ' and ' thank ' is also 1 * @see but suppose ' are ' is a stop word (stopfilter effect), then the position increment between ' how ' and ' You ' becomes 2 * @see When we look for an element, Lucene first takes the element by its position increment, but what happens if the two-word position increment is the same? * @see Suppose there is also a word ' this ', its position increment and ' how ' are the same, Then when we search for ' this ' in the interface * @see will also search for ' How to Are you thank ', so that you can effectively do synonyms, the current very popular one called WordNet, you can do a synonym search * @see--------
-------------------------------------------------------------------------------
-------------------------------- * @create Aug 4, 2013 5:48:25 PM * @author Xuan Yu