[Lucene3.6.2 entry series] section 04th _ Chinese Word Divider

Source: Internet
Author: User
Package COM. jadyer. lucene; import Java. io. ioexception; import Java. io. stringreader; import Org. apache. lucene. analysis. analyzer; import Org. apache. lucene. analysis. simpleanalyzer; import Org. apache. lucene. analysis. stopanalyzer; import Org. apache. lucene. analysis. tokenstream; import Org. apache. lucene. analysis. whitespaceanalyzer; import Org. apache. lucene. analysis. standard. standardanalyzer; import Org. apache. luce Ne. analysis. tokenattributes. chartermattribute; import Org. apache. lucene. analysis. tokenattributes. offsetattribute; import Org. apache. lucene. analysis. tokenattributes. positionincrementattribute; import Org. apache. lucene. analysis. tokenattributes. typeattribute; import Org. apache. lucene. util. version; import COM. chenlb. mmseg4j. analysis. complexanalyzer; import COM. chenlb. mmseg4j. analysis. mmseganalyzer;/*** [Lu Cene3.6.2 entry-level series] section 04th _ Chinese analyzer * @ see javase3.5: simpleanalyzer, stopanalyzer, whitespaceanalyzer, standardanalyzer * @ see the four word divider has a common abstract parent class. This class has a public final tokenstream () method (), that is, a stream of Word Segmentation * @ see suppose there is such a text "how are you thank you", actually it is a Java. io. in the reader, * @ see Lucene Word Segmentation After processing, the entire word segmentation is converted to tokenstream, which stores all word segmentation information * @ see tokenstream has two implementation classes, tokenizer and tokenfilter * @ see tokenizer ----> are used to divide a set of data into independent vocabulary units (that is, words one by one) * @ see tokenfilter --> filter vocabulary units * @ see keywords * @ see word splitting process * @ see 1) use a set of data streams in Java. io. reader is handed over to tokenizer, which converts data into Vocabulary units * @ see 2) through a large number of tokenfilt Er filters data with good words and generates tokenstream * @ see 3) using tokenstream to store indexes * @ see tokens * @ see tokenizer sub-classes * @ see keywordtokenizer ----- No word segmentation. If you specify anything, You Can index the indexes * @ see standardtokenizer ---- standard word segmentation, it has some intelligent word splitting operations, such as 'Yeah. net 'as a word segmentation stream * @ see chartokenizer -------- for character control, it has two sub-classes: whitespacetoken Izer and lettertokenizer * @ see whitespacetokenizer -- use spaces for word segmentation. For example, 'Thank you and I am jadyer 'are divided into four words * @ see lettertokenizer ------ word segmentation based on text words, it is segmented by punctuation marks, for example, 'Thank you, I am jadyer 'is divided into five words * @ see lowercasetokenizer --- It is a subclass of lettertokenizer, it converts the data into lowercase characters and Word Segmentation * @ see tokens * @ see tokenfilter sub-classes * @ see stop Filter -------- it will disable some vocabulary units * @ see lowercasefilter --- convert data to lowercase * @ see standardfilter ---- control the standard output stream * @ see porterstemfilter -- restore some data, for example, if you restore a coming to come and countries to country * @ see Tables * @ See eg: 'How are you thank you', it will be split into 'who', 'are ', 'You', 'Thank ', and 'you' total 5 vocabulary units * @ see what should be saved to restore the number of data in the future What should I do if the data is restored correctly ??? In fact, it mainly saves three things, as shown below * @ see chartermattribute (tere3.5 previously called termattribute), offsetattribute, positionincrementattribute * @ see 1) chartermattribute ----------- Save the corresponding vocabulary, here, we save 'who', 'are', 'you', 'Thank, 'you' * @ see 2) offsetattribute ------------- Save the offset between each word (generally in order). For example, the offset of the first and last letters of 'who' is 0 and 3, and the offset of 'all' is 4 and 7, 'Thank 'is 12 and 17 * @ see 3) positionincrementattribute -- save the increment of the position between the word and the word, for example, the increment of 'who' and 'are' is 1, the relationship between 'are 'and 'You' is also 1. The relationship between 'you' and 'thank is also 1 * @ see, but suppose 'are' is the stopfilter effect ), then the increment of the position between 'who' and 'You' is 2 * @ see. When we look for an element, Lucene will first retrieve this element through the increment of the position, but if the increment of the two words is the same, what will happen? * @ see suppose there is another word 'I'. Its increment of position is the same as that of 'who, then, when we search for "this" in the interface * @ see, we will also find "how are you thank you", so that we can effectively create synonyms, currently, WordNet is a very popular product, you can search for synonyms. * @ see keywords * @ see Chinese Word divider * @ see Lucene many word divider provided by default are not applicable to Chinese characters. * @ see 1) paoding- Ding word divider, official Website for http://code.google.com/p/paoding (seemingly hosted in http://git.oschina.net/zhzhenqin/paoding-analysis) * @ see 2) mmseg4j-it is said that it uses sogou dictionary, official website for https://code.google.com/p/mmseg4j (also has a https://code.google.com/p/jcseg) * @ ses 3) ik ------- https://code.google.com/p/ik-analyzer/ * @ see references * @ see mmseg4j usage * @ see 1zip download mmseg4j-1.8.5.zip and introduce mmseg4j-all-1.8.5-with-dic.jar * @ see 2) write new mmseganalyzer () Where you need to specify a word Divider () you can * @ See note 1) because the use of the mmseg4j-all-1.8.5-with-dic.jar has built-in dictionary, so direct New mmseganalyzer () can * @ See note 2) if the introduction of mmseg4j-all-1.8.5.jar, specify the dictionary directory, such as new mmseganalyzer ("D: \ develop \ mmseg4j-1.8.5 \ data ") * @ see but if you do not want to use new mmseganalyzer()(, you need to merge the data directory that comes with mmseg4j-1.8.5.zip into classpath. * @ See summary: Just introduce the mmseg4j-all-1.8.5-with-dic.jar directly * @ See example * @ create Aug 2, 2013 5:30:45 pm * @ author Xuan Yu 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.