Write this blog when you have read the sixth chapter of the word breaker, before writing code, this word breaker, let me have a strong interest.
//===========================================================================================//
The following four word breakers are available in English, not for Chinese
//===========================================================================================//
1, Whitespaceanalyzer
Just remove the space, the character is not lowcase , does not support Chinese;
And does not do other normalization processing of the generated vocabulary unit.
2, Simpleanalyzer
The function is stronger than Whitespaceanalyzer, First the text information is separated by non-alphabetic characters, and then the lexical unit is unified in lowercase form. The parser removes characters of the numeric type.
3, Stopanalyzer
Stopanalyzer function beyond the Simpleanalyzer, on the basis of Simpleanalyzer added to remove common words in English (such as the,a, etc.), You can also have more of your own need to set common words ;
4, StandardAnalyzer
The ability to handle English is the same as Stopanalyzer. The method used in support of Chinese is word segmentation . He converts the lexical unit to lowercase and removes the inactive words and punctuation marks.
//=============================================================================================//
The following 2 word breakers are available for Chinese
//==============================================================================================//
5. Cjkanalyzer
China-Japan-Korea Analyzer, can be on the Chinese, Japanese, Korean language analysis of the word breaker, but the effect of support in general, generally do not
6,Smartchineseanalyzer
Slightly better support for Chinese, but poor extensibility, extended thesaurus, disable thesaurus, and other difficult to handle
5. Cjkanalyzer
China-Japan-Korea Analyzer, can be on the Chinese, Japanese, Korean language analysis of the word breaker, but the effect of support in general, generally do not
//=========================================================================//
A simple test:
Network Code:
//=========================================================================//
public class Analyzerdemo {
/**whitespaceanalyzer Analyzer */
public void Whitespaceanalyzer (String msg) {
Whitespaceanalyzer Analyzer = new Whitespaceanalyzer (version.lucene_36);
This.gettokens (analyzer, MSG);
}
/**simpleanalyzer Analyzer */
public void Simpleanalyzer (String msg) {
Simpleanalyzer Analyzer = new Simpleanalyzer (version.lucene_36);
This.gettokens (analyzer, MSG);
}
/**stopanalyzer Analyzer */
public void Stopanalyzer (String msg) {
Stopanalyzer Analyzer = new Stopanalyzer (version.lucene_36);
This.gettokens (analyzer, MSG);
}
/**standardanalyzer Analyzer */
public void StandardAnalyzer (String msg) {
StandardAnalyzer Analyzer = new StandardAnalyzer (version.lucene_36);
This.gettokens (analyzer, MSG);
}
private void Gettokens (Analyzer analyzer,string msg) {
Tokenstream Tokenstream=analyzer.tokenstream ("Content", new StringReader (msg));
This.printtokens (Analyzer.getclass (). Getsimplename (), tokenstream);
}
private void Printtokens (String analyzertype,tokenstream tokenstream) {
Chartermattribute ta = Tokenstream.addattribute (chartermattribute.class);
StringBuffer result =new StringBuffer ();
try {
while (Tokenstream.incrementtoken ()) {
if (Result.length () >0) {
Result.append (",");
}
Result.append ("[" +ta.tostring () + "]");
}
} catch (IOException e) {
E.printstacktrace ();
}
System.out.println (analyzertype+ "-" +result.tostring ());
}
}
Main method ":
Private Tokenizerdemo demo = null;
Private String msg = "I like you, my motherland!" China ";
Private String msg = "I Love you, china! Consumer ";
@Before
public void SetUp () throws Exception {
Demo=new Tokenizerdemo ();
}
@Test
public void Testwhitespaceanalyzer () {
Demo.whitespaceanalyzer (msg);
}
@Test
public void Testsimpleanalyzer () {
Demo.simpleanalyzer (msg);
}
@Test
public void Teststopanalyzer () {
Demo.stopanalyzer (msg);
}
@Test
public void Teststandardanalyzer () {
Demo.standardanalyzer (msg);
}
}
5,cjkanalyzer
China-Japan-Korea Analyzer, can be on the Chinese, Japanese, Korean language analysis of the word breaker, but the effect of support in general, generally do not
Lucene built-in analyzer word breaker