Apache Lucene several participle system

Source: Internet
Author: User
Tags deprecated solr

1, Stopanalyzer

Stopanalyzer can filter specific strings and vocabularies in a vocabulary and complete uppercase to lowercase functions.

2, StandardAnalyzer

StandardAnalyzer According to the space and symbols to complete the word segmentation, but also can complete the numbers, letters, e-mail address, IP address and Chinese characters of the analysis processing, but also support the filter vocabulary, used to replace the Stopanalyzer can achieve the filtering function.

3, Simpleanalyzer

Simpleanalyzer has a word breaker with basic Western character lexical analysis, and when processing the lexical unit, it takes non-alphabetic characters as the segmentation symbol. Word breaker can not do the word filtering, the lexical analysis and segmentation. The output Word unit completes the lowercase character conversion, and removes punctuation and other delimiters.

in the development of full-text retrieval system, it is usually used to support the processing of western symbols, not Chinese. There is no need to filter thesaurus support because the word filtering function is not completed. Lexical segmentation strategy is simple, the use of non-English characters as a separator, do not need word thesaurus support.

4, Whitespaceanalyzer

Whitespaceanalyzer uses whitespace as the delimiter word to split the word breaker. When working with lexical units, use a space character as the dividing symbol. Word breakers do not filter words and do not convert lowercase characters.

in practice, it can be used to support the processing of western symbols in a particular environment. There is no need to filter the Thesaurus support because Word filtering and lowercase character conversion are not completed. The lexical segmentation strategy simply uses non-English characters as the separator, and does not require word-breaker support.

5, Keywordanalyzer

Keywordanalyzer the entire input as a single lexical unit, facilitating the indexing and retrieval of special types of text. For postal code, address and other text information using the keyword Word breaker for index entries to establish very convenient.

6, Cjkanalyzer

cjkanalyzer Internal call Cjktokenizer word breaker, Chinese word segmentation, while using Stopfilter filter to complete the filtering function, can realize the Chinese multi-slice and stop word filtering. Deprecated in the Lucene3.0 version.

7, Chineseanalyzer

The Chineseanalyzer function and the StandardAnalyzer parser are basically consistent in the processing of Chinese, and are split into a single double-byte Chinese character. Deprecated in the Lucene3.0 version.

8, Perfieldanalyzerwrapper

The perfieldanalyzerwrapper function is mainly used for different applications where the analyzer is used for different field. For example, for file names, you need to use Keywordanalyzer, and only use StandardAnalyzer for the contents of the files. The classifier can be added by Addanalyzer ().

9, Ikanalyzer

The paper realizes the dictionary-based forward and backward total segmentation, and the two methods of maximum matching between positive and negative. Ikanalyzer is a third-party implementation of the word breaker, inherited from the Lucene Analyzer class, for Chinese text processing.

10, Je-analysis

Je-analysis is Lucene's Chinese word breaker and needs to be downloaded.

11, ictclas4j

ictclas4j Chinese Word segmentation system is sinboy in the CAs Zhang Huaping and Liu Qun teachers of the development of Freeictclas based on the completion of a Java open source word breaker, simplifying the complexity of the original word segmentation program, designed for the vast number of Chinese word lovers a better learning opportunities.

12, Imdict-chinese-analyzer

Imdict-chinese-analyzer is the intelligent Chinese word segmentation module of Imdict Intelligent Dictionary, which is based on hidden Markov model (Hidden Markov model, HMM), It is a re-implementation of the Chinese Ictclas (based on Java), which can provide simplified Chinese word support for Lucene search engine directly.

13. Paoding Analysis

paoding Analysis Chinese word segmentation has very high efficiency and high expansibility. The introduction of metaphor, the use of complete object-oriented design, advanced ideas. Its efficiency is high, in the PIII 1G memory personal machine, 1 seconds can accurately participle 1 million kanji. The paper is effectively segmented based on the unlimited number of dictionary files, so that the lexical classification can be defined. Able to interpret the unknown words rationally.

14, mmseg4j

mmseg4j uses Chih-hao Tsai's mmseg algorithm (http://technology.chtsai.org/mmseg/) to implement the Chinese word breaker, and to implement Lucene Analyzer and SOLR Tokeniz Erfactory for ease of use in Lucene and SOLR. MMSEG algorithm has two kinds of word segmentation methods: simple and complex, both are based on the forward maximum matching. Complex added four rules. Official said: The correct recognition rate of words reached 98.41%. MMSEG4J has implemented both of these segmentation algorithms.

Apache Lucene several participle system

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.