Several word breakers commonly used in Lucene 4.4.0

Source: Internet
Author: User
Tags apache solr cas lowercase sca solr

First, Whitespaceanalyzer

Use a space as a word-cutting standard, not other normalization of the vocabulary unit. It is clear that this practical English, with spaces between the words.

Package Bond.lucene.analyzer;
Import Org.apache.lucene.analysis.TokenStream;
Import Org.apache.lucene.analysis.core.WhitespaceAnalyzer;
Import Org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

Import org.apache.lucene.util.Version; public class Whitespaceanalyzertest {public static void main (string[] args) {try {//Text to be processed///The Lucene parser uses participle The filter and the filter constitute a "pipe", the text after this pipeline becomes the smallest unit that can enter the index, therefore, a standard parser has two parts, one is the word breaker tokenizer, it is used to divide the text according to the rule to be able to enter the index the smallest unit. The other is Tokenfilter, its main function is to cut out the word for further processing (such as removing sensitive words, English capitalization conversion, single plural processing) and so on.
			The Tokenstram method in Lucene first creates a Tokenizer object to process the streaming text in the reader object and then filters the output stream using Tokenfilter.

			String Text = "The Lucene PMC is pleased to announce" the ' Apache Solr Reference Guide for SOLR 4.4. ";

			The space word breaker (takes the space as the word-cutting standard, does not carry on other normalization processing to the vocabulary unit) Whitespaceanalyzer WSA = new Whitespaceanalyzer (version.lucene_44);
			Tokenstream ts = wsa.tokenstream ("field", text);

			Chartermattribute ch = ts.addattribute (chartermattribute.class);
			Ts.reset (); while (ts.incremenTtoken ()) {System.out.println (ch.tostring ());
			} ts.end ();
		Ts.close ();
		catch (Exception ex) {ex.printstacktrace ();
 }

	}
}
Second, Simpleanalyzer

Splits text information alphabetically, and unifies the token to lowercase, and removes the character of the numeric type. It is obviously not suitable for Chinese environment.

Package Bond.lucene.analyzer;
Import Org.apache.lucene.analysis.TokenStream;
Import Org.apache.lucene.analysis.core.SimpleAnalyzer;
Import Org.apache.lucene.analysis.tokenattributes.CharTermAttribute;

Import org.apache.lucene.util.Version; public class Simpleanalyzertest {public static void main (string[] args) {try {//Text to be processed//"The Lucene parser uses word breakers and filters The device constitutes a "pipe", the text after the pipeline into the smallest unit can enter the index, therefore, a standard parser has two parts, one is the word breaker tokenizer, it is used to cut the text according to the rules of the smallest unit can enter the index. The other is Tokenfilter, its main function is to cut out the word for further processing (such as removing sensitive words, English capitalization conversion, single plural processing) and so on.
			The Tokenstram method in Lucene first creates a Tokenizer object to process the streaming text in the reader object and then filters the output stream using Tokenfilter.

			String Text = "The Lucene PMC is pleased to announce" the ' Apache Solr Reference Guide for SOLR 4.4. ";

			A simple word breaker (which splits text information with a non-alphanumeric character and unifies the vocabulary unit to lowercase and removes characters of the numeric type) simpleanalyzer sa = new Simpleanalyzer (version.lucene_44);
			Tokenstream ts = sa.tokenstream ("field", text);

			Chartermattribute ch = ts.addattribute (chartermattribute.class);
			Ts.reset (); while (Ts.incrementtoken()) {System.out.println (ch.tostring ());
			} ts.end ();
		Ts.close ();
		catch (Exception ex) {ex.printstacktrace ();
 }
	}

}

Third, Stopanalyzer

Pause Word Analyzer will remove some often a,the,an and so on, but also can customize the disabled word, does not apply to the Chinese environment

Package Bond.lucene.analyzer;

Import Java.util.Iterator;
Import Org.apache.lucene.analysis.TokenStream;
Import Org.apache.lucene.analysis.core.StopAnalyzer;
Import Org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
Import Org.apache.lucene.analysis.util.CharArraySet;

Import org.apache.lucene.util.Version; public class Stopanalyzertest {public static void main (string[] args) {try {//Text to be processed//"The Lucene parser uses a word breaker and filter to construct into a "pipe", the text after the pipeline into the smallest unit that can enter the index, therefore, a standard parser has two parts, one is the word breaker tokenizer, it is used to cut the text according to the rules of the smallest unit can enter the index. The other is Tokenfilter, its main function is to cut out the word for further processing (such as removing sensitive words, English capitalization conversion, single plural processing) and so on.
			The Tokenstram method in Lucene first creates a Tokenizer object to process the streaming text in the reader object and then filters the output stream using Tokenfilter.

			String Text = "The Lucene PMC is pleased to announce" the ' Apache Solr Reference Guide for SOLR 4.4. ";
			Custom Deactivate word string[] Self_stop_words = {"Analysis", "release", "Apache"};
			Chararrayset cas = new Chararrayset (version.lucene_44, 0, true);
for (int i = 0; i < self_stop_words.length; i++) {				Cas.add (Self_stop_words[i]);
			}//Join system default Deactivate word iterator<object> Itor = StopAnalyzer.ENGLISH_STOP_WORDS_SET.iterator ();
			while (Itor.hasnext ()) {Cas.add (Itor.next ());

			}//Deactivate word breaker (remove some often a,the,an and so on, also can customize the disabled word) stopanalyzer sa = new Stopanalyzer (version.lucene_44, CAs);
			Tokenstream ts = sa.tokenstream ("field", text);

			Chartermattribute ch = ts.addattribute (chartermattribute.class);
			Ts.reset ();
			while (Ts.incrementtoken ()) {System.out.println (ch.tostring ());
			} ts.end ();
		Ts.close ();
		catch (Exception ex) {ex.printstacktrace ();
 }
	}
}

Four, StandardAnalyzer

The standard parser is a built-in parser in Lucene that converts the vocabulary unit to lowercase and removes the word stop and punctuation, which is clearly not suitable for Chinese environments.

Package Bond.lucene.analyzer;

Import Java.util.Iterator;
Import Org.apache.lucene.analysis.TokenStream;
Import Org.apache.lucene.analysis.standard.StandardAnalyzer;
Import Org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
Import Org.apache.lucene.analysis.util.CharArraySet;

Import org.apache.lucene.util.Version; public class Standardanalyzertest {public static void main (string[] args) {try {//Text to be processed//The Lucene parser uses a word breaker and The filter constitutes a "pipe", the text after the pipeline into the smallest unit that can enter the index, therefore, a standard parser has two parts, one is the word breaker tokenizer, it is used to cut the text according to the rules into one can enter the index of the smallest unit. The other is Tokenfilter, its main function is to cut out the word for further processing (such as removing sensitive words, English capitalization conversion, single plural processing) and so on.
			The Tokenstram method in Lucene first creates a Tokenizer object to process the streaming text in the reader object and then filters the output stream using Tokenfilter.

			String Text = "The Lucene PMC is pleased to announce" the ' Apache Solr Reference Guide for SOLR 4.4. ";
			Custom Deactivate word string[] self_stop_words = {"Lucene", "release", "Apache"};
			Chararrayset cas = new Chararrayset (version.lucene_44, 0, true); for (int i = 0; i < self_stop_words. length;
			i++) {Cas.add (self_stop_words[i]);
			}//Join system default Deactivate word iterator<object> Itor = StandardAnalyzer.STOP_WORDS_SET.iterator ();
			while (Itor.hasnext ()) {Cas.add (Itor.next ()); //Standard word breaker (LUCENE's built-in standard parser converts the vocabulary unit to lowercase and removes the stop word and punctuation) standardanalyzer sa = new StandardAnalyzer (version.lucene_44, C

			AS);
			Tokenstream ts = sa.tokenstream ("field", text);

			Chartermattribute ch = ts.addattribute (chartermattribute.class);
			Ts.reset ();
			while (Ts.incrementtoken ()) {System.out.println (ch.tostring ());
			} ts.end ();
		Ts.close ();
		catch (Exception ex) {ex.printstacktrace ();
 }
	}

}

Five, Cjkanalyzer

China, Japan and Korea Analyzer, can be on, Japanese, Korean language analysis of the word breaker, but the Chinese support generally, generally not

Package Bond.lucene.analyzer;

Import Java.util.Iterator;
Import Org.apache.lucene.analysis.TokenStream;
Import Org.apache.lucene.analysis.cjk.CJKAnalyzer;
Import Org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
Import Org.apache.lucene.analysis.util.CharArraySet;

Import org.apache.lucene.util.Version; public class Cjkanalyzertest {public static void main (string[] args) {try {//Text to be processed//' The Lucene PMC is P
			Leased to announce the release of the Apache SOLR Reference Guide for SOLR 4.4. "; String Text = "The Lucene parser uses a word breaker and filter to form a" pipe ", and the text flows through the pipe into the smallest unit that can enter the index, so a standard parser consists of two parts, one is the word breaker tokenizer, It is used to divide the text into the smallest unit that can enter the index, according to the rules. The other is Tokenfilter, its main function is to cut out the word for further processing (such as removing sensitive words, English capitalization conversion, single plural processing) and so on.

			The Tokenstram method in Lucene first creates a Tokenizer object to process the streaming text in the reader object and then filters the output stream using Tokenfilter.
			Custom Deactivate word string[] Self_stop_words = {"Use", "one", "pipe"};
			Chararrayset cas = new Chararrayset (version.lucene_44, 0, true); for (int i = 0; i < self_stop_words.length i++) {Cas.add(Self_stop_words[i]);
			}//Join system default Deactivate word iterator<object> Itor = Cjkanalyzer.getdefaultstopset (). iterator ();
			while (Itor.hasnext ()) {Cas.add (Itor.next ()); //CJK Word breaker (C:china; J:japan;

			K:korea, able to the Chinese, Japanese, Korean language analysis of the word breaker, the support effect in general, basically not used in Chinese word segmentation) cjkanalyzer sa = new Cjkanalyzer (version.lucene_44, CAs);
			Tokenstream ts = sa.tokenstream ("field", text);

			Chartermattribute ch = ts.addattribute (chartermattribute.class);
			Ts.reset ();
			while (Ts.incrementtoken ()) {System.out.println (ch.tostring ());
			} ts.end ();
		Ts.close ();
		catch (Exception ex) {ex.printstacktrace ();
 }
	}

}

VI. Smartchineseanalyzer

Good for Chinese support, but poor scalability, extended thesaurus, disable thesaurus and synonym library, etc. bad deal

Package Bond.lucene.analyzer;

Import Java.util.Iterator;
Import Org.apache.lucene.analysis.TokenStream;
Import Org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer;
Import Org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
Import Org.apache.lucene.analysis.util.CharArraySet;

Import org.apache.lucene.util.Version; public class Smartchineseanalyzertest {public static void main (string[] args) {try {//text to be processed String text = ' The Lucene parser uses a word breaker and filter to form a "pipe", which, after passing through the pipe, becomes the smallest unit that can enter the index, so a standard parser consists of two parts, one is the word breaker tokenizer, It is used to divide the text into the smallest unit that can enter the index, according to the rules. The other is Tokenfilter, its main function is to cut out the word for further processing (such as removing sensitive words, English capitalization conversion, single plural processing) and so on.

			The Tokenstram method in Lucene first creates a Tokenizer object to process the streaming text in the reader object and then filters the output stream using Tokenfilter.
			Custom Stop word string[] self_stop_words = {"," "," ",", "," 0 ",": ",", "," Yes "," stream "};
			Chararrayset cas = new Chararrayset (version.lucene_44, 0, true);
			for (int i = 0; i < self_stop_words.length i++) {cas.add (self_stop_words[i]); }//Join system default Deactivate word iterator<Object> itor = Smartchineseanalyzer.getdefaultstopset (). iterator ();
			while (Itor.hasnext ()) {Cas.add (Itor.next ());

			//in Chinese and English mixed word breaker (other a few word breaker on the analysis of China is not good) smartchineseanalyzer SCA = new Smartchineseanalyzer (version.lucene_44, CAs);
			Tokenstream ts = sca.tokenstream ("field", text);

			Chartermattribute ch = ts.addattribute (chartermattribute.class);
			Ts.reset ();
			while (Ts.incrementtoken ()) {System.out.println (ch.tostring ());
			} ts.end ();
		Ts.close ();
		catch (Exception ex) {ex.printstacktrace ();
 }
	}

}

For Chinese word segmentation processing, the overall, lucene processing is not very good, my classmate recommended a good word segmentation effect, expansion is also very convenient open source Library

Http://nlp.stanford.edu/software/segmenter.shtml

Now there is no research, ran a demo, found that the word segmentation effect is very good



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.