Lucene Series-Analyzers

Source: Internet
Author: User
Tags solr
Profiler Introduction

Search is based on the analysis of textual information, Lucene analysis tools in the Org.apache.lucene.analysis package. Parser is responsible for the text word segmentation, language processing to get the entry , index and search need to use the parser, the two should be the same, otherwise it will not be a good match.

Lucene's analyzers often include a word breaker (tokenizer) and multiple filters (Tokenfilter), which are responsible for processing the cut-off words, such as removing sensitive words, converting the case, converting a single plural, and so on. The Tokenstream method is often preceded by the use of a tokenizer, followed by StandardAnalyzer with multiple tokenfilter,lucene to use Standardtokenizer with Standardfilter, Lowercasefilter and Stopfilter. The abstract base class structure diagram is as follows.


Note: Stop words-words that are frequently used but not meaningful, are ignored when indexing or searching is under construction. English articles, prepositions, conjunctions (An/this/and), Chinese in the "/also/for". Common word breakers

With "Hello, this is a test case." Hello, this is an example of a test. Created on 20140707 "for example.

StandardAnalyzer: According to the space, punctuation cut words, Chinese word-cut, ignore stop words.

[Hello] [Test] [Case] You Good This Is A A Logging Try Of Real Cases [Created] [20140707]
Stopanalyzer: spaces, punctuation segmentation in English, ignore stop words, ignore numbers.
[Hello] [Test] [Case] Hello [This is an example of a test] [Created]
Simpleanalyzer: spaces, punctuation in English, ignore numbers.
[Hello] [This] [IS] A [Test] [Case] Hello [This is an example of a test] [Created] [On]
Whitespaceanalyzer: Space segmentation in English.
[Hello,] [This] [IS] A [Test] [Case.] [Hello, this is an example of a test.] CREATED][ON][20140707]
Chinese word breakerWord segmentation: such as StandardAnalyzer dichotomy: every 2 words as a word for the segmentation, you can reduce the length of the location information after each entry. such as Cjkanalyzer
[Hello] [Test] [Case] Hello [This is] [Is a] A [A Test] Test [Try] [The real] Instance [Created] [20140707]
Dictionary participle: Construct a common dictionary for Word segmentation, such as Mmseg4j's Maxwordanalyzer
[Hello] [This] [IS] A [Test] [Case] Hello [This is] A Test Of Instance [Created] [On] [20140707]
some tricks how to get tokens after a word cutBefore lucene3.5
Analyzer Analyzer = new Maxwordanalyzer ();
Tokenstream stream = Analyzer.tokenstream ("", New StringReader ("Hello, this is a test case." +
               "Hello, it's an example of a test." "+  " created on 20140707 "));
String out = "";
while (Stream.incrementtoken ()) {out
    + = "[" + Stream.getattribute (termattribute.class). Term () + "]";
}
System.out.println (out);
After lucene3.5
Analyzer Analyzer = new StandardAnalyzer ();
Tokenstream stream = Analyzer.tokenstream ("", New StringReader ("Hello, this is a test case." +
               "Hello, it's an example of a test." "+  " created on 20140707 "));
String out = "";
while (Stream.incrementtoken ()) {out
    + = "[" + Stream.getattribute (chartermattribute.class). toString () + "]";
}
System.out.println (out);


mmseg4j Dictionary

The dictionary requires UTF-8 encoding, you can specify the dictionary path when you instantiate analyzer, or you can set Mmseg.dic.path to specify the dictionary path, and the author says that he will read the dictionary file from the data directory in the current directory by default, but I can't seem to test it. If you do not specify a path, the dictionary is loaded from the data directory in the mmseg4j jar package.

Chars.dic each line is a single word and corresponding frequency, in the middle with a space separate, generally do not care. Guess is not the low frequency of the word can be considered garbled, discard does not build index it.

Units.dic each line is a unit of words, such as points, acres, for separate segmentation.

Words.dic is the core thesaurus, one line, you can download Sogou thesaurus http://www.sogou.com/labs/dl/w.html.

Wordsxxx.dic is a custom word library file.

3 packages are required when using mmseg4j: Mmseg4j-core.jar contains thesaurus files, Mmseg4j-analysis.jar are some analysis (such as Maxwordanalyzer), Mmseg4j-solr.jar are some of the features that SOLR uses.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.