Reprint Please specify source: http://blog.csdn.net/xiaojimanman/article/details/42916755
In the process of creating indexes in Lucene, the processing of data information is a very important process, in this process, the main part is the topic of this blog: word breaker . In the following simple demo, we introduce 7 of the more common word segmentation technology, namely: Cjkanalyzer, Keywordanalyzer, Simpleanalyzer, Stopanalyzer, Whitespaceanalyzer, StandardAnalyzer, Ikanalyzer, can be verified in the form of annotations. The source program is as follows:
Analyzer Participle Demo
/** * @Description: Participle Technology Demo */Package com.lulei.lucene.study; Import Java.io.stringreader;import Org.apache.lucene.analysis.analyzer;import Org.apache.lucene.analysis.tokenstream;import Org.apache.lucene.analysis.cjk.cjkanalyzer;import Org.apache.lucene.analysis.core.keywordanalyzer;import Org.apache.lucene.analysis.core.simpleanalyzer;import Org.apache.lucene.analysis.core.stopanalyzer;import Org.apache.lucene.analysis.core.whitespaceanalyzer;import Org.apache.lucene.analysis.standard.standardanalyzer;import Org.apache.lucene.analysis.tokenattributes.chartermattribute;import Org.apache.lucene.util.version;import Org.wltea.analyzer.lucene.IKAnalyzer; public class Analyzerstudy {public static void main (string[] args) throws Exception {//test string to be processed string str = "This is a word breaker test program , I hope you will continue to pay attention to my Personal series blog: Based on Lucene case development, here add a little space with the label LUCENE java word breaker "; Analyzer Analyzer = null;//Standard word breaker, if used to deal with Chinese, and Chineseanalyzer have the same effect, this may be the later version of a reason to discard chineseanalyzer Analyzer = new StandardAnalyzer (version.lucene_43);/third-party Chinese word breaker with the following 2 construction methods. Analyzer = new Ikanalyzer (), analyzer = new Ikanalyzer (false), analyzer = new Ikanalyzer (true);//Space word breaker, how to deal with the string analyzer = new Whitespaceanalyzer (version.lucene_43);//Simple word breaker, a paragraph of words to make the word analyzer = new Simpleanalyzer (version.lucene_43);// Binary word breaker, this participle is a positive backward word (dichotomy), the same word will and its left and right combination of a time, each appeared two times, except the first word and the last word analyzer = new Cjkanalyzer (version.lucene_43); The keyword word breaker that treats the string as a whole analyzer = new Keywordanalyzer ();//ignored word breaker Analyzer = new Stopanalyzer (version.lucene_43);// Use a word breaker to process the test string StringReader reader = new StringReader (str); Tokenstream Tokenstream = Analyzer.tokenstream ("", reader); Tokenstream.reset (); Chartermattribute term = tokenstream.getattribute (chartermattribute.class); int l = 0;//output word breaker and processing result System.out.println ( Analyzer.getclass ()); while (Tokenstream.incrementtoken ()) {System.out.print (term.tostring () + "|"); L + = term.tostring (). Length ()//If the line output is more than 30 words, wrap output if (L >) {System.out.println (); l = 0;}} }}
Note: The above procedure has been assigned 9 times for analyzer and can see the word segmentation effect of each word segmentation technique by one by one annotations.
Word Breaker Introduction
The following will do some simple introduction to these word breakers, and the above program in the operation of the word breaker:
StandardAnalyzer
StandardAnalyzer Standard word breaker, if used to deal with Chinese, and Chineseanalyzer have the same effect, this may be the later version of the deprecated Chineseanalyzer one reason. Using StandardAnalyzer to deal with the English effect is good, but the processing of Chinese only divides it into a single character, there is no semantic or part of speech, if there is no other word breaker, with StandardAnalyzer to deal with Chinese or can, The above example uses the StandardAnalyzer participle technique to run the result as follows:
Ikanalyzer
Ikanalyzer is a third-party Chinese word segmentation technology based on Lucene, which is based on the existing Chinese thesaurus, and there are two methods for constructing the analyzer object, which is equivalent to the new Ikanalyzer (false), in the introduction of true/ False two parameters under the different word breakers before you look at the case of the two cases running results:
False run results such as:
True to run results such as:
From the above example, we can easily see that the case of false will be divided into the word, if there is a small length of the word, also as a result of the word segmentation. Ikanalyzer is a more commonly used Chinese word segmentation technology, but its word segmentation effect is too dependent on the dictionary, so to achieve better results, it is necessary to constantly upgrade their own dictionaries.
Whitespaceanalyzer
Whitespaceanalyzer space participle, the word segmentation technology is equivalent to the space of a simple segmentation of the string, the formation of the sub-string does not do other operations, the results are similar to the results of String.Split (""). The results of the above examples under the whitespaceanalyzer participle technique are as follows:
This technique of Word segmentation maybe you will definitely not much effect, it on the input string almost did not do too much processing, the result of the statement processing is not too good, if this thought is wrong, the following simple think about this question, this blog is the label Lucene, Java, word breaker, Then how to store these three words in the index, what kind of word segmentation technology is used? Do not do any solution here, think for yourself, in the future of the novel case will be the label of the domain to propose a specific solution.
Simpleanalyzer
Simpleanalyzer simple word breaker, rather than a paragraph to do participle, rather say is a sentence is a word, encountered punctuation, spaces, etc., the content of its previous as a word. The results of the above examples under the simpleanalyzer participle technique are as follows:
Cjkanalyzer
Cjkanalyzer is a binary word breaker, this participle is a positive backward word (dichotomy), the same word will and its left and right combination of a time, each person appears two times, in addition to the first and last words, that will be any two adjacent Chinese characters as a word, This kind of word segmentation technology will produce a lot of useless phrases. The results of the above examples under the cjkanalyzer participle technique are as follows:
Keywordanalyzer
Keywordanalyzer keyword word breaker, the processing of the string as a whole, the word breaker, in the previous version of Lucene may also have a role, but in recent versions, Lucene to the type of domain has been subdivided, its role is not too big, do not do in Luke, is still quite important. The results of the above examples under the keywordanalyzer participle technique are as follows:
Stopanalyzer
Stopanalyzer ignored word breaker, the ignored word is in the word segmentation results, discarded strings, such as punctuation, spaces and so on. The results of the above examples under the stopanalyzer participle technique are as follows:
The above 7 kinds of word segmentation technology can be processed in Chinese, the foreign language (non-English) processing has the following several word segmentation techniques:
Braziliananalyzer Brazilian language participle
Czechanalyzer Czech language participle
Dutchanalyzer Dutch language participle
Frenchanalyzer French Language participle
Germananalyzer German language participle
Greekanalyzer Greek language participle
Russiananalyzer Russian language participle
Thaianalyzer Thai Language participle
Lucene-based case development: a word breaker introduction