Apache Lucene several participle system

Last Update:2015-02-26 Source: Internet

Author: User

Tags deprecated solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1, Stopanalyzer

Stopanalyzer can filter specific strings and vocabularies in a vocabulary and complete uppercase to lowercase functions.

2, StandardAnalyzer

StandardAnalyzer According to the space and symbols to complete the word segmentation, but also can complete the numbers, letters, e-mail address, IP address and Chinese characters of the analysis processing, but also support the filter vocabulary, used to replace the Stopanalyzer can achieve the filtering function.

3, Simpleanalyzer

Simpleanalyzer has a word breaker with basic Western character lexical analysis, and when processing the lexical unit, it takes non-alphabetic characters as the segmentation symbol. Word breaker can not do the word filtering, the lexical analysis and segmentation. The output Word unit completes the lowercase character conversion, and removes punctuation and other delimiters.

in the development of full-text retrieval system, it is usually used to support the processing of western symbols, not Chinese. There is no need to filter thesaurus support because the word filtering function is not completed. Lexical segmentation strategy is simple, the use of non-English characters as a separator, do not need word thesaurus support.

4, Whitespaceanalyzer

Whitespaceanalyzer uses whitespace as the delimiter word to split the word breaker. When working with lexical units, use a space character as the dividing symbol. Word breakers do not filter words and do not convert lowercase characters.

in practice, it can be used to support the processing of western symbols in a particular environment. There is no need to filter the Thesaurus support because Word filtering and lowercase character conversion are not completed. The lexical segmentation strategy simply uses non-English characters as the separator, and does not require word-breaker support.

5, Keywordanalyzer

Keywordanalyzer the entire input as a single lexical unit, facilitating the indexing and retrieval of special types of text. For postal code, address and other text information using the keyword Word breaker for index entries to establish very convenient.

6, Cjkanalyzer

cjkanalyzer Internal call Cjktokenizer word breaker, Chinese word segmentation, while using Stopfilter filter to complete the filtering function, can realize the Chinese multi-slice and stop word filtering. Deprecated in the Lucene3.0 version.

7, Chineseanalyzer

The Chineseanalyzer function and the StandardAnalyzer parser are basically consistent in the processing of Chinese, and are split into a single double-byte Chinese character. Deprecated in the Lucene3.0 version.

8, Perfieldanalyzerwrapper

The perfieldanalyzerwrapper function is mainly used for different applications where the analyzer is used for different field. For example, for file names, you need to use Keywordanalyzer, and only use StandardAnalyzer for the contents of the files. The classifier can be added by Addanalyzer ().

9, Ikanalyzer

The paper realizes the dictionary-based forward and backward total segmentation, and the two methods of maximum matching between positive and negative. Ikanalyzer is a third-party implementation of the word breaker, inherited from the Lucene Analyzer class, for Chinese text processing.

10, Je-analysis

Je-analysis is Lucene's Chinese word breaker and needs to be downloaded.

11, ictclas4j

ictclas4j Chinese Word segmentation system is sinboy in the CAs Zhang Huaping and Liu Qun teachers of the development of Freeictclas based on the completion of a Java open source word breaker, simplifying the complexity of the original word segmentation program, designed for the vast number of Chinese word lovers a better learning opportunities.

12, Imdict-chinese-analyzer

Imdict-chinese-analyzer is the intelligent Chinese word segmentation module of Imdict Intelligent Dictionary, which is based on hidden Markov model (Hidden Markov model, HMM), It is a re-implementation of the Chinese Ictclas (based on Java), which can provide simplified Chinese word support for Lucene search engine directly.

13. Paoding Analysis

paoding Analysis Chinese word segmentation has very high efficiency and high expansibility. The introduction of metaphor, the use of complete object-oriented design, advanced ideas. Its efficiency is high, in the PIII 1G memory personal machine, 1 seconds can accurately participle 1 million kanji. The paper is effectively segmented based on the unlimited number of dictionary files, so that the lexical classification can be defined. Able to interpret the unknown words rationally.

14, mmseg4j

mmseg4j uses Chih-hao Tsai's mmseg algorithm (http://technology.chtsai.org/mmseg/) to implement the Chinese word breaker, and to implement Lucene Analyzer and SOLR Tokeniz Erfactory for ease of use in Lucene and SOLR. MMSEG algorithm has two kinds of word segmentation methods: simple and complex, both are based on the forward maximum matching. Complex added four rules. Official said: The correct recognition rate of words reached 98.41%. MMSEG4J has implemented both of these segmentation algorithms.

Apache Lucene several participle system

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More