Lucene Series-Analyzers

Last Update:2018-07-25 Source: Internet

Author: User

Tags solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Profiler Introduction

Search is based on the analysis of textual information, Lucene analysis tools in the Org.apache.lucene.analysis package. Parser is responsible for the text word segmentation, language processing to get the entry , index and search need to use the parser, the two should be the same, otherwise it will not be a good match.

Lucene's analyzers often include a word breaker (tokenizer) and multiple filters (Tokenfilter), which are responsible for processing the cut-off words, such as removing sensitive words, converting the case, converting a single plural, and so on. The Tokenstream method is often preceded by the use of a tokenizer, followed by StandardAnalyzer with multiple tokenfilter,lucene to use Standardtokenizer with Standardfilter, Lowercasefilter and Stopfilter. The abstract base class structure diagram is as follows.

Note: Stop words-words that are frequently used but not meaningful, are ignored when indexing or searching is under construction. English articles, prepositions, conjunctions (An/this/and), Chinese in the "/also/for". Common word breakers

With "Hello, this is a test case." Hello, this is an example of a test. Created on 20140707 "for example.

StandardAnalyzer: According to the space, punctuation cut words, Chinese word-cut, ignore stop words.

[Hello] [Test] [Case] You Good This Is A A Logging Try Of Real Cases [Created] [20140707]

Stopanalyzer: spaces, punctuation segmentation in English, ignore stop words, ignore numbers.

[Hello] [Test] [Case] Hello [This is an example of a test] [Created]

Simpleanalyzer: spaces, punctuation in English, ignore numbers.

[Hello] [This] [IS] A [Test] [Case] Hello [This is an example of a test] [Created] [On]

Whitespaceanalyzer: Space segmentation in English.

[Hello,] [This] [IS] A [Test] [Case.] [Hello, this is an example of a test.] CREATED][ON][20140707]

Chinese word breakerWord segmentation: such as StandardAnalyzer dichotomy: every 2 words as a word for the segmentation, you can reduce the length of the location information after each entry. such as Cjkanalyzer

[Hello] [Test] [Case] Hello [This is] [Is a] A [A Test] Test [Try] [The real] Instance [Created] [20140707]

Dictionary participle: Construct a common dictionary for Word segmentation, such as Mmseg4j's Maxwordanalyzer

[Hello] [This] [IS] A [Test] [Case] Hello [This is] A Test Of Instance [Created] [On] [20140707]

some tricks how to get tokens after a word cutBefore lucene3.5

Analyzer Analyzer = new Maxwordanalyzer ();
Tokenstream stream = Analyzer.tokenstream ("", New StringReader ("Hello, this is a test case." +
               "Hello, it's an example of a test." "+  " created on 20140707 "));
String out = "";
while (Stream.incrementtoken ()) {out
    + = "[" + Stream.getattribute (termattribute.class). Term () + "]";
}
System.out.println (out);

After lucene3.5

Analyzer Analyzer = new StandardAnalyzer ();
Tokenstream stream = Analyzer.tokenstream ("", New StringReader ("Hello, this is a test case." +
               "Hello, it's an example of a test." "+  " created on 20140707 "));
String out = "";
while (Stream.incrementtoken ()) {out
    + = "[" + Stream.getattribute (chartermattribute.class). toString () + "]";
}
System.out.println (out);

mmseg4j Dictionary

The dictionary requires UTF-8 encoding, you can specify the dictionary path when you instantiate analyzer, or you can set Mmseg.dic.path to specify the dictionary path, and the author says that he will read the dictionary file from the data directory in the current directory by default, but I can't seem to test it. If you do not specify a path, the dictionary is loaded from the data directory in the mmseg4j jar package.

Chars.dic each line is a single word and corresponding frequency, in the middle with a space separate, generally do not care. Guess is not the low frequency of the word can be considered garbled, discard does not build index it.

Units.dic each line is a unit of words, such as points, acres, for separate segmentation.

Words.dic is the core thesaurus, one line, you can download Sogou thesaurus http://www.sogou.com/labs/dl/w.html.

Wordsxxx.dic is a custom word library file.

3 packages are required when using mmseg4j: Mmseg4j-core.jar contains thesaurus files, Mmseg4j-analysis.jar are some analysis (such as Maxwordanalyzer), Mmseg4j-solr.jar are some of the features that SOLR uses.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More