[Lucene3.6.2 entry series] section 04th

[Lucene3.6.2 entry series] section 04th _ Chinese Word Divider

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Package COM. jadyer. lucene; import Java. io. ioexception; import Java. io. stringreader; import Org. apache. lucene. analysis. analyzer; import Org. apache. lucene. analysis. simpleanalyzer; import Org. apache. lucene. analysis. stopanalyzer; import Org. apache. lucene. analysis. tokenstream; import Org. apache. lucene. analysis. whitespaceanalyzer; import Org. apache. lucene. analysis. standard. standardanalyzer; import Org. apache. luce Ne. analysis. tokenattributes. chartermattribute; import Org. apache. lucene. analysis. tokenattributes. offsetattribute; import Org. apache. lucene. analysis. tokenattributes. positionincrementattribute; import Org. apache. lucene. analysis. tokenattributes. typeattribute; import Org. apache. lucene. util. version; import COM. chenlb. mmseg4j. analysis. complexanalyzer; import COM. chenlb. mmseg4j. analysis. mmseganalyzer;/*** [Lu Cene3.6.2 entry-level series] section 04th _ Chinese analyzer * @ see javase3.5: simpleanalyzer, stopanalyzer, whitespaceanalyzer, standardanalyzer * @ see the four word divider has a common abstract parent class. This class has a public final tokenstream () method (), that is, a stream of Word Segmentation * @ see suppose there is such a text "how are you thank you", actually it is a Java. io. in the reader, * @ see Lucene Word Segmentation After processing, the entire word segmentation is converted to tokenstream, which stores all word segmentation information * @ see tokenstream has two implementation classes, tokenizer and tokenfilter * @ see tokenizer ----> are used to divide a set of data into independent vocabulary units (that is, words one by one) * @ see tokenfilter --> filter vocabulary units * @ see keywords * @ see word splitting process * @ see 1) use a set of data streams in Java. io. reader is handed over to tokenizer, which converts data into Vocabulary units * @ see 2) through a large number of tokenfilt Er filters data with good words and generates tokenstream * @ see 3) using tokenstream to store indexes * @ see tokens * @ see tokenizer sub-classes * @ see keywordtokenizer ----- No word segmentation. If you specify anything, You Can index the indexes * @ see standardtokenizer ---- standard word segmentation, it has some intelligent word splitting operations, such as 'Yeah. net 'as a word segmentation stream * @ see chartokenizer -------- for character control, it has two sub-classes: whitespacetoken Izer and lettertokenizer * @ see whitespacetokenizer -- use spaces for word segmentation. For example, 'Thank you and I am jadyer 'are divided into four words * @ see lettertokenizer ------ word segmentation based on text words, it is segmented by punctuation marks, for example, 'Thank you, I am jadyer 'is divided into five words * @ see lowercasetokenizer --- It is a subclass of lettertokenizer, it converts the data into lowercase characters and Word Segmentation * @ see tokens * @ see tokenfilter sub-classes * @ see stop Filter -------- it will disable some vocabulary units * @ see lowercasefilter --- convert data to lowercase * @ see standardfilter ---- control the standard output stream * @ see porterstemfilter -- restore some data, for example, if you restore a coming to come and countries to country * @ see Tables * @ See eg: 'How are you thank you', it will be split into 'who', 'are ', 'You', 'Thank ', and 'you' total 5 vocabulary units * @ see what should be saved to restore the number of data in the future What should I do if the data is restored correctly ??? In fact, it mainly saves three things, as shown below * @ see chartermattribute (tere3.5 previously called termattribute), offsetattribute, positionincrementattribute * @ see 1) chartermattribute ----------- Save the corresponding vocabulary, here, we save 'who', 'are', 'you', 'Thank, 'you' * @ see 2) offsetattribute ------------- Save the offset between each word (generally in order). For example, the offset of the first and last letters of 'who' is 0 and 3, and the offset of 'all' is 4 and 7, 'Thank 'is 12 and 17 * @ see 3) positionincrementattribute -- save the increment of the position between the word and the word, for example, the increment of 'who' and 'are' is 1, the relationship between 'are 'and 'You' is also 1. The relationship between 'you' and 'thank is also 1 * @ see, but suppose 'are' is the stopfilter effect ), then the increment of the position between 'who' and 'You' is 2 * @ see. When we look for an element, Lucene will first retrieve this element through the increment of the position, but if the increment of the two words is the same, what will happen? * @ see suppose there is another word 'I'. Its increment of position is the same as that of 'who, then, when we search for "this" in the interface * @ see, we will also find "how are you thank you", so that we can effectively create synonyms, currently, WordNet is a very popular product, you can search for synonyms. * @ see keywords * @ see Chinese Word divider * @ see Lucene many word divider provided by default are not applicable to Chinese characters. * @ see 1) paoding- Ding word divider, official Website for http://code.google.com/p/paoding (seemingly hosted in http://git.oschina.net/zhzhenqin/paoding-analysis) * @ see 2) mmseg4j-it is said that it uses sogou dictionary, official website for https://code.google.com/p/mmseg4j (also has a https://code.google.com/p/jcseg) * @ ses 3) ik ------- https://code.google.com/p/ik-analyzer/ * @ see references * @ see mmseg4j usage * @ see 1zip download mmseg4j-1.8.5.zip and introduce mmseg4j-all-1.8.5-with-dic.jar * @ see 2) write new mmseganalyzer () Where you need to specify a word Divider () you can * @ See note 1) because the use of the mmseg4j-all-1.8.5-with-dic.jar has built-in dictionary, so direct New mmseganalyzer () can * @ See note 2) if the introduction of mmseg4j-all-1.8.5.jar, specify the dictionary directory, such as new mmseganalyzer ("D: \ develop \ mmseg4j-1.8.5 \ data ") * @ see but if you do not want to use new mmseganalyzer()(, you need to merge the data directory that comes with mmseg4j-1.8.5.zip into classpath. * @ See summary: Just introduce the mmseg4j-all-1.8.5-with-dic.jar directly * @ See example * @ create Aug 2, 2013 5:30:45 pm * @ author Xuan Yu

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[Lucene3.6.2 entry series] section 04th _ Chinese Word Divider

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

[Lucene3.6.2 entry series] section 04th _ Chinese Word Divider

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support