Lucene built-in analyzer word breaker

Last Update:2015-04-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Write this blog when you have read the sixth chapter of the word breaker, before writing code, this word breaker, let me have a strong interest.

//===========================================================================================//

The following four word breakers are available in English, not for Chinese

//===========================================================================================//

1, Whitespaceanalyzer

Just remove the space, the character is not lowcase , does not support Chinese;

And does not do other normalization processing of the generated vocabulary unit.

2, Simpleanalyzer

The function is stronger than Whitespaceanalyzer, First the text information is separated by non-alphabetic characters, and then the lexical unit is unified in lowercase form. The parser removes characters of the numeric type.

3, Stopanalyzer

Stopanalyzer function beyond the Simpleanalyzer, on the basis of Simpleanalyzer added to remove common words in English (such as the,a, etc.), You can also have more of your own need to set common words ;

4, StandardAnalyzer

The ability to handle English is the same as Stopanalyzer. The method used in support of Chinese is word segmentation . He converts the lexical unit to lowercase and removes the inactive words and punctuation marks.

//=============================================================================================//

The following 2 word breakers are available for Chinese

//==============================================================================================//

5. Cjkanalyzer

China-Japan-Korea Analyzer, can be on the Chinese, Japanese, Korean language analysis of the word breaker, but the effect of support in general, generally do not

6,Smartchineseanalyzer

Slightly better support for Chinese, but poor extensibility, extended thesaurus, disable thesaurus, and other difficult to handle

5. Cjkanalyzer

China-Japan-Korea Analyzer, can be on the Chinese, Japanese, Korean language analysis of the word breaker, but the effect of support in general, generally do not

//=========================================================================//

A simple test:

Network Code:

//=========================================================================//

public class Analyzerdemo {

/**whitespaceanalyzer Analyzer */
public void Whitespaceanalyzer (String msg) {
Whitespaceanalyzer Analyzer = new Whitespaceanalyzer (version.lucene_36);
This.gettokens (analyzer, MSG);
}

/**simpleanalyzer Analyzer */
public void Simpleanalyzer (String msg) {
Simpleanalyzer Analyzer = new Simpleanalyzer (version.lucene_36);
This.gettokens (analyzer, MSG);
}

/**stopanalyzer Analyzer */
public void Stopanalyzer (String msg) {
Stopanalyzer Analyzer = new Stopanalyzer (version.lucene_36);
This.gettokens (analyzer, MSG);
}

/**standardanalyzer Analyzer */
public void StandardAnalyzer (String msg) {
StandardAnalyzer Analyzer = new StandardAnalyzer (version.lucene_36);
This.gettokens (analyzer, MSG);
}

private void Gettokens (Analyzer analyzer,string msg) {
Tokenstream Tokenstream=analyzer.tokenstream ("Content", new StringReader (msg));
This.printtokens (Analyzer.getclass (). Getsimplename (), tokenstream);
}

private void Printtokens (String analyzertype,tokenstream tokenstream) {
Chartermattribute ta = Tokenstream.addattribute (chartermattribute.class);
StringBuffer result =new StringBuffer ();
try {
while (Tokenstream.incrementtoken ()) {
if (Result.length () >0) {
Result.append (",");
}
Result.append ("[" +ta.tostring () + "]");
}
} catch (IOException e) {
E.printstacktrace ();
}

System.out.println (analyzertype+ "-" +result.tostring ());
}
}

Main method ":

Private Tokenizerdemo demo = null;

Private String msg = "I like you, my motherland!" China ";
Private String msg = "I Love you, china! Consumer ";
@Before
public void SetUp () throws Exception {
Demo=new Tokenizerdemo ();
}

@Test
public void Testwhitespaceanalyzer () {
Demo.whitespaceanalyzer (msg);
}

@Test
public void Testsimpleanalyzer () {
Demo.simpleanalyzer (msg);
}

@Test
public void Teststopanalyzer () {
Demo.stopanalyzer (msg);
}

@Test
public void Teststandardanalyzer () {
Demo.standardanalyzer (msg);
}
}

5,cjkanalyzer

China-Japan-Korea Analyzer, can be on the Chinese, Japanese, Korean language analysis of the word breaker, but the effect of support in general, generally do not

Lucene built-in analyzer word breaker

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Lucene built-in analyzer word breaker

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Lucene built-in analyzer word breaker

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support