Introduction to Lucene 3.6.2 (5) Custom stop word breaker and synonym breaker

Last Update:2017-02-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First is the Hellocustomanalyzer.java used to display participle information

Package com.jadyer.lucene; Import java.io.IOException; Import Java.io.StringReader; Import Org.apache.lucene.analysis.Analyzer; Import Org.apache.lucene.analysis.TokenStream; Import Org.apache.lucene.analysis.tokenattributes.CharTermAttribute; Import Org.apache.lucene.analysis.tokenattributes.OffsetAttribute; Import Org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; Import Org.apache.lucene.analysis.tokenattributes.TypeAttribute; /** * "Lucene3.6.2 Introductory Series" section No. 05 _ Custom Word breaker * @see------------------------------------------------------------------------- ----------------------------------------------* @see Lucene3.5 recommended four major word breakers: Simpleanalyzer,stopanalyzer, Whitespaceanalyzer,standardanalyzer * @see The four major participle have a common abstract parent class, this class has a method public final Tokenstream Tokenstream (), that is, a stream of participle * @ If there is such a text "How are you thank", the actual it is a java.io.Reader into the word breaker * @see Lucene word breaker after processing, the whole Word will be converted to Tokenstream, In this tokenstream, all the participle information is saved * @see Tokenstream has twoImplementation classes, Tokenizer and Tokenfilter * @see tokenizer----> is used to divide a set of data into separate vocabularies (that is, one word) * @see tokenfilter--> Filter Token unit * @s EE------------------------------------------------------------------------------------------------------------- ----------* @see Participle process * @see 1) Java.io.Reader a set of data streams to Tokenizer, which converts the data into a single token unit * @see 2) through a large number of tokenfilter to the already-divided-word data into Row filtering operations, resulting in Tokenstream * @see 3) The storage of the index through Tokenstream * @see--------------------------------------------------------- Some subclasses of the--------------------------------------------------------------* @see tokenizer * @see keywordtokenizer-----No participle , what is the index? * @see standardtokenizer----Standard participle, it has some more intelligent participle operations, such as the ' jadyer@yeah.net ' in the ' yeah.net ' as a word of the flow * @see Chartokenizer --------is controlled for characters, it also has two subclasses Whitespacetokenizer and Lettertokenizer * @see whitespacetokenizer--use a space for participle, such as the ' Thank you,i Am Jadyer ' will be divided into 4 words * @see lettertokenizer------participle based on the text word, which will be based on punctuation, such as the ' Thank you,i am jadyer ' will be divided into 5 words * @see Lowerca Setokenizer---It is a lettertokenizer subclass that willdata to lowercase and participle * @see------------------------------------------------------------------------------------------------ Some subclasses of the-----------------------* @see tokenfilter * @see stopfilter--------It deactivates some vocabulary units * @see Lowercasefilter---Convert the data to a small Write * @see standardfilter----Make some control of the standard output stream * @see porterstemfilter--restore Some data, such as restoring coming to come, and countries to country * @see --------------------------------------------------------------------------------------- ------------------------ --------* @see eg: ' How are your thank you ' will be participle for ' how ', ' are ', ' you ', ' thank ', ' you ' equals 5 vocabulary units * @see Then what should be saved so that later when you need to restore the data Ensure the correct restoration??? In fact, the main preservation of three things, as shown below * @see Chartermattribute (Lucene3.5 formerly known as Termattribute), Offsetattribute,positionincrementattribute * (1) Chartermattribute-----------Save the corresponding vocabulary, here is the ' how ', ' are ', ' you ', ' thank ', ' you ' * @see 2) Offsetattribute-------------Keep the offsets (roughly in order) between the words, such as ' how ' the initial and trailing letters are offset by 0 and 3, ' are ' is 4 and 7, ' thank ' is 12 and * @see 3) positionincrementattribute--saves position increments between words, such as ' How ' and ' are ' increments of 1, ' are ' and ' you' is also 1, ' You ' and ' thank ' is also 1 * @see but suppose ' are ' is a stop word (stopfilter effect), then the position increment between ' how ' and ' You ' becomes 2 * @see When we look for an element, Lucene first takes the element by its position increment, but what happens if the two-word position increment is the same? * @see Suppose there is also a word ' this ', its position increment and ' how ' are the same, Then when we search for ' this ' in the interface * @see will also search for ' How to Are you thank ', so that you can effectively do synonyms, the current very popular one called WordNet, you can do a synonym search * @see-------- ------------------------------------------------------------------------------- -------------------------------- * @create Aug 4, 2013 5:48:25 PM * @author Xuan Yu

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Introduction to Lucene 3.6.2 (5) Custom stop word breaker and synonym breaker

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Introduction to Lucene 3.6.2 (5) Custom stop word breaker and synonym breaker

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support