Several word breakers commonly used in Lucene 4.4.0

Last Update:2018-07-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, Whitespaceanalyzer

Use a space as a word-cutting standard, not other normalization of the vocabulary unit. It is clear that this practical English, with spaces between the words.

Package Bond.lucene.analyzer;
Import Org.apache.lucene.analysis.TokenStream;
Import Org.apache.lucene.analysis.core.WhitespaceAnalyzer;
Import Org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

Import org.apache.lucene.util.Version; public class Whitespaceanalyzertest {public static void main (string[] args) {try {//Text to be processed///The Lucene parser uses participle The filter and the filter constitute a "pipe", the text after this pipeline becomes the smallest unit that can enter the index, therefore, a standard parser has two parts, one is the word breaker tokenizer, it is used to divide the text according to the rule to be able to enter the index the smallest unit. The other is Tokenfilter, its main function is to cut out the word for further processing (such as removing sensitive words, English capitalization conversion, single plural processing) and so on.
			The Tokenstram method in Lucene first creates a Tokenizer object to process the streaming text in the reader object and then filters the output stream using Tokenfilter.

			String Text = "The Lucene PMC is pleased to announce" the ' Apache Solr Reference Guide for SOLR 4.4. ";

			The space word breaker (takes the space as the word-cutting standard, does not carry on other normalization processing to the vocabulary unit) Whitespaceanalyzer WSA = new Whitespaceanalyzer (version.lucene_44);
			Tokenstream ts = wsa.tokenstream ("field", text);

			Chartermattribute ch = ts.addattribute (chartermattribute.class);
			Ts.reset (); while (ts.incremenTtoken ()) {System.out.println (ch.tostring ());
			} ts.end ();
		Ts.close ();
		catch (Exception ex) {ex.printstacktrace ();
 }

	}
}

Second, Simpleanalyzer

Splits text information alphabetically, and unifies the token to lowercase, and removes the character of the numeric type. It is obviously not suitable for Chinese environment.

Package Bond.lucene.analyzer;
Import Org.apache.lucene.analysis.TokenStream;
Import Org.apache.lucene.analysis.core.SimpleAnalyzer;
Import Org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

Import org.apache.lucene.util.Version; public class Simpleanalyzertest {public static void main (string[] args) {try {//Text to be processed//"The Lucene parser uses word breakers and filters The device constitutes a "pipe", the text after the pipeline into the smallest unit can enter the index, therefore, a standard parser has two parts, one is the word breaker tokenizer, it is used to cut the text according to the rules of the smallest unit can enter the index. The other is Tokenfilter, its main function is to cut out the word for further processing (such as removing sensitive words, English capitalization conversion, single plural processing) and so on.
			The Tokenstram method in Lucene first creates a Tokenizer object to process the streaming text in the reader object and then filters the output stream using Tokenfilter.

			String Text = "The Lucene PMC is pleased to announce" the ' Apache Solr Reference Guide for SOLR 4.4. ";

			A simple word breaker (which splits text information with a non-alphanumeric character and unifies the vocabulary unit to lowercase and removes characters of the numeric type) simpleanalyzer sa = new Simpleanalyzer (version.lucene_44);
			Tokenstream ts = sa.tokenstream ("field", text);

			Chartermattribute ch = ts.addattribute (chartermattribute.class);
			Ts.reset (); while (Ts.incrementtoken()) {System.out.println (ch.tostring ());
			} ts.end ();
		Ts.close ();
		catch (Exception ex) {ex.printstacktrace ();
 }
	}

}

Third, Stopanalyzer

Pause Word Analyzer will remove some often a,the,an and so on, but also can customize the disabled word, does not apply to the Chinese environment

Package Bond.lucene.analyzer;

Import Java.util.Iterator;
Import Org.apache.lucene.analysis.TokenStream;
Import Org.apache.lucene.analysis.core.StopAnalyzer;
Import Org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
Import Org.apache.lucene.analysis.util.CharArraySet;

Import org.apache.lucene.util.Version; public class Stopanalyzertest {public static void main (string[] args) {try {//Text to be processed//"The Lucene parser uses a word breaker and filter to construct into a "pipe", the text after the pipeline into the smallest unit that can enter the index, therefore, a standard parser has two parts, one is the word breaker tokenizer, it is used to cut the text according to the rules of the smallest unit can enter the index. The other is Tokenfilter, its main function is to cut out the word for further processing (such as removing sensitive words, English capitalization conversion, single plural processing) and so on.
			The Tokenstram method in Lucene first creates a Tokenizer object to process the streaming text in the reader object and then filters the output stream using Tokenfilter.

			String Text = "The Lucene PMC is pleased to announce" the ' Apache Solr Reference Guide for SOLR 4.4. ";
			Custom Deactivate word string[] Self_stop_words = {"Analysis", "release", "Apache"};
			Chararrayset cas = new Chararrayset (version.lucene_44, 0, true);
for (int i = 0; i < self_stop_words.length; i++) {				Cas.add (Self_stop_words[i]);
			}//Join system default Deactivate word iterator<object> Itor = StopAnalyzer.ENGLISH_STOP_WORDS_SET.iterator ();
			while (Itor.hasnext ()) {Cas.add (Itor.next ());

			}//Deactivate word breaker (remove some often a,the,an and so on, also can customize the disabled word) stopanalyzer sa = new Stopanalyzer (version.lucene_44, CAs);
			Tokenstream ts = sa.tokenstream ("field", text);

			Chartermattribute ch = ts.addattribute (chartermattribute.class);
			Ts.reset ();
			while (Ts.incrementtoken ()) {System.out.println (ch.tostring ());
			} ts.end ();
		Ts.close ();
		catch (Exception ex) {ex.printstacktrace ();
 }
	}
}

Four, StandardAnalyzer

The standard parser is a built-in parser in Lucene that converts the vocabulary unit to lowercase and removes the word stop and punctuation, which is clearly not suitable for Chinese environments.

Package Bond.lucene.analyzer;

Import Java.util.Iterator;
Import Org.apache.lucene.analysis.TokenStream;
Import Org.apache.lucene.analysis.standard.StandardAnalyzer;
Import Org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
Import Org.apache.lucene.analysis.util.CharArraySet;

Import org.apache.lucene.util.Version; public class Standardanalyzertest {public static void main (string[] args) {try {//Text to be processed//The Lucene parser uses a word breaker and The filter constitutes a "pipe", the text after the pipeline into the smallest unit that can enter the index, therefore, a standard parser has two parts, one is the word breaker tokenizer, it is used to cut the text according to the rules into one can enter the index of the smallest unit. The other is Tokenfilter, its main function is to cut out the word for further processing (such as removing sensitive words, English capitalization conversion, single plural processing) and so on.
			The Tokenstram method in Lucene first creates a Tokenizer object to process the streaming text in the reader object and then filters the output stream using Tokenfilter.

			String Text = "The Lucene PMC is pleased to announce" the ' Apache Solr Reference Guide for SOLR 4.4. ";
			Custom Deactivate word string[] self_stop_words = {"Lucene", "release", "Apache"};
			Chararrayset cas = new Chararrayset (version.lucene_44, 0, true); for (int i = 0; i < self_stop_words. length;
			i++) {Cas.add (self_stop_words[i]);
			}//Join system default Deactivate word iterator<object> Itor = StandardAnalyzer.STOP_WORDS_SET.iterator ();
			while (Itor.hasnext ()) {Cas.add (Itor.next ()); //Standard word breaker (LUCENE's built-in standard parser converts the vocabulary unit to lowercase and removes the stop word and punctuation) standardanalyzer sa = new StandardAnalyzer (version.lucene_44, C

			AS);
			Tokenstream ts = sa.tokenstream ("field", text);

			Chartermattribute ch = ts.addattribute (chartermattribute.class);
			Ts.reset ();
			while (Ts.incrementtoken ()) {System.out.println (ch.tostring ());
			} ts.end ();
		Ts.close ();
		catch (Exception ex) {ex.printstacktrace ();
 }
	}

}

Five, Cjkanalyzer

China, Japan and Korea Analyzer, can be on, Japanese, Korean language analysis of the word breaker, but the Chinese support generally, generally not

Package Bond.lucene.analyzer;

Import Java.util.Iterator;
Import Org.apache.lucene.analysis.TokenStream;
Import Org.apache.lucene.analysis.cjk.CJKAnalyzer;
Import Org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
Import Org.apache.lucene.analysis.util.CharArraySet;

Import org.apache.lucene.util.Version; public class Cjkanalyzertest {public static void main (string[] args) {try {//Text to be processed//' The Lucene PMC is P
			Leased to announce the release of the Apache SOLR Reference Guide for SOLR 4.4. "; String Text = "The Lucene parser uses a word breaker and filter to form a" pipe ", and the text flows through the pipe into the smallest unit that can enter the index, so a standard parser consists of two parts, one is the word breaker tokenizer, It is used to divide the text into the smallest unit that can enter the index, according to the rules. The other is Tokenfilter, its main function is to cut out the word for further processing (such as removing sensitive words, English capitalization conversion, single plural processing) and so on.

			The Tokenstram method in Lucene first creates a Tokenizer object to process the streaming text in the reader object and then filters the output stream using Tokenfilter.
			Custom Deactivate word string[] Self_stop_words = {"Use", "one", "pipe"};
			Chararrayset cas = new Chararrayset (version.lucene_44, 0, true); for (int i = 0; i < self_stop_words.length i++) {Cas.add(Self_stop_words[i]);
			}//Join system default Deactivate word iterator<object> Itor = Cjkanalyzer.getdefaultstopset (). iterator ();
			while (Itor.hasnext ()) {Cas.add (Itor.next ()); //CJK Word breaker (C:china; J:japan;

			K:korea, able to the Chinese, Japanese, Korean language analysis of the word breaker, the support effect in general, basically not used in Chinese word segmentation) cjkanalyzer sa = new Cjkanalyzer (version.lucene_44, CAs);
			Tokenstream ts = sa.tokenstream ("field", text);

			Chartermattribute ch = ts.addattribute (chartermattribute.class);
			Ts.reset ();
			while (Ts.incrementtoken ()) {System.out.println (ch.tostring ());
			} ts.end ();
		Ts.close ();
		catch (Exception ex) {ex.printstacktrace ();
 }
	}

}

VI. Smartchineseanalyzer

Good for Chinese support, but poor scalability, extended thesaurus, disable thesaurus and synonym library, etc. bad deal

Package Bond.lucene.analyzer;

Import Java.util.Iterator;
Import Org.apache.lucene.analysis.TokenStream;
Import Org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;
Import Org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
Import Org.apache.lucene.analysis.util.CharArraySet;

Import org.apache.lucene.util.Version; public class Smartchineseanalyzertest {public static void main (string[] args) {try {//text to be processed String text = ' The Lucene parser uses a word breaker and filter to form a "pipe", which, after passing through the pipe, becomes the smallest unit that can enter the index, so a standard parser consists of two parts, one is the word breaker tokenizer, It is used to divide the text into the smallest unit that can enter the index, according to the rules. The other is Tokenfilter, its main function is to cut out the word for further processing (such as removing sensitive words, English capitalization conversion, single plural processing) and so on.

			The Tokenstram method in Lucene first creates a Tokenizer object to process the streaming text in the reader object and then filters the output stream using Tokenfilter.
			Custom Stop word string[] self_stop_words = {"," "," ",", "," 0 ",": ",", "," Yes "," stream "};
			Chararrayset cas = new Chararrayset (version.lucene_44, 0, true);
			for (int i = 0; i < self_stop_words.length i++) {cas.add (self_stop_words[i]); }//Join system default Deactivate word iterator<Object> itor = Smartchineseanalyzer.getdefaultstopset (). iterator ();
			while (Itor.hasnext ()) {Cas.add (Itor.next ());

			//in Chinese and English mixed word breaker (other a few word breaker on the analysis of China is not good) smartchineseanalyzer SCA = new Smartchineseanalyzer (version.lucene_44, CAs);
			Tokenstream ts = sca.tokenstream ("field", text);

			Chartermattribute ch = ts.addattribute (chartermattribute.class);
			Ts.reset ();
			while (Ts.incrementtoken ()) {System.out.println (ch.tostring ());
			} ts.end ();
		Ts.close ();
		catch (Exception ex) {ex.printstacktrace ();
 }
	}

}

For Chinese word segmentation processing, the overall, lucene processing is not very good, my classmate recommended a good word segmentation effect, expansion is also very convenient open source Library

Http://nlp.stanford.edu/software/segmenter.shtml

Now there is no research, ran a demo, found that the word segmentation effect is very good

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Several word breakers commonly used in Lucene 4.4.0

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Several word breakers commonly used in Lucene 4.4.0

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support