Lucene built-in analyzer word breaker

Source: Internet
Author: User

Write this blog when you have read the sixth chapter of the word breaker, before writing code, this word breaker, let me have a strong interest.


//===========================================================================================//

The following four word breakers are available in English, not for Chinese

//===========================================================================================//

1, Whitespaceanalyzer

Just remove the space, the character is not lowcase , does not support Chinese;

And does not do other normalization processing of the generated vocabulary unit.


2, Simpleanalyzer

The function is stronger than Whitespaceanalyzer, First the text information is separated by non-alphabetic characters, and then the lexical unit is unified in lowercase form. The parser removes characters of the numeric type.


3, Stopanalyzer

Stopanalyzer function beyond the Simpleanalyzer, on the basis of Simpleanalyzer added to remove common words in English (such as the,a, etc.), You can also have more of your own need to set common words ;


4, StandardAnalyzer

The ability to handle English is the same as Stopanalyzer. The method used in support of Chinese is word segmentation . He converts the lexical unit to lowercase and removes the inactive words and punctuation marks.




//=============================================================================================//

The following 2 word breakers are available for Chinese

//==============================================================================================//



5. Cjkanalyzer

China-Japan-Korea Analyzer, can be on the Chinese, Japanese, Korean language analysis of the word breaker, but the effect of support in general, generally do not




6,Smartchineseanalyzer

Slightly better support for Chinese, but poor extensibility, extended thesaurus, disable thesaurus, and other difficult to handle



5. Cjkanalyzer

China-Japan-Korea Analyzer, can be on the Chinese, Japanese, Korean language analysis of the word breaker, but the effect of support in general, generally do not


//=========================================================================//

A simple test:

Network Code:

//=========================================================================//



public class Analyzerdemo {


/**whitespaceanalyzer Analyzer */
public void Whitespaceanalyzer (String msg) {
Whitespaceanalyzer Analyzer = new Whitespaceanalyzer (version.lucene_36);
This.gettokens (analyzer, MSG);
}

/**simpleanalyzer Analyzer */
public void Simpleanalyzer (String msg) {
Simpleanalyzer Analyzer = new Simpleanalyzer (version.lucene_36);
This.gettokens (analyzer, MSG);
}

/**stopanalyzer Analyzer */
public void Stopanalyzer (String msg) {
Stopanalyzer Analyzer = new Stopanalyzer (version.lucene_36);
This.gettokens (analyzer, MSG);
}

/**standardanalyzer Analyzer */
public void StandardAnalyzer (String msg) {
StandardAnalyzer Analyzer = new StandardAnalyzer (version.lucene_36);
This.gettokens (analyzer, MSG);
}


private void Gettokens (Analyzer analyzer,string msg) {
Tokenstream Tokenstream=analyzer.tokenstream ("Content", new StringReader (msg));
This.printtokens (Analyzer.getclass (). Getsimplename (), tokenstream);
}

private void Printtokens (String analyzertype,tokenstream tokenstream) {
Chartermattribute ta = Tokenstream.addattribute (chartermattribute.class);
StringBuffer result =new StringBuffer ();
try {
while (Tokenstream.incrementtoken ()) {
if (Result.length () >0) {
Result.append (",");
}
Result.append ("[" +ta.tostring () + "]");
}
} catch (IOException e) {
E.printstacktrace ();
}

System.out.println (analyzertype+ "-" +result.tostring ());
}
}



Main method ":


Private Tokenizerdemo demo = null;

Private String msg = "I like you, my motherland!" China ";
Private String msg = "I Love you, china! Consumer ";
@Before
public void SetUp () throws Exception {
Demo=new Tokenizerdemo ();
}


@Test
public void Testwhitespaceanalyzer () {
Demo.whitespaceanalyzer (msg);
}

@Test
public void Testsimpleanalyzer () {
Demo.simpleanalyzer (msg);
}

@Test
public void Teststopanalyzer () {
Demo.stopanalyzer (msg);
}

@Test
public void Teststandardanalyzer () {
Demo.standardanalyzer (msg);
}
}

5,cjkanalyzer

China-Japan-Korea Analyzer, can be on the Chinese, Japanese, Korean language analysis of the word breaker, but the effect of support in general, generally do not


Lucene built-in analyzer word breaker

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.