1, Stopanalyzer
Stopanalyzer can filter specific strings and vocabularies in a vocabulary and complete uppercase to lowercase functions.
2, StandardAnalyzer
StandardAnalyzer According to the space and symbols to complete the word segmentation, but also can complete the numbers, letters, e-mail address, IP address and Chinese characters of the analysis processing, but also support the filter vocabulary, used to replace the Stopanalyzer can achieve the filtering function.
3, Simpleanalyzer
Simpleanalyzer has a word breaker with basic Western character lexical analysis, and when processing the lexical unit, it takes non-alphabetic characters as the segmentation symbol. Word breaker can not do the word filtering, the lexical analysis and segmentation. The output Word unit completes the lowercase character conversion, and removes punctuation and other delimiters.
in the development of full-text retrieval system, it is usually used to support the processing of western symbols, not Chinese. There is no need to filter thesaurus support because the word filtering function is not completed. Lexical segmentation strategy is simple, the use of non-English characters as a separator, do not need word thesaurus support.
4, Whitespaceanalyzer
Whitespaceanalyzer uses whitespace as the delimiter word to split the word breaker. When working with lexical units, use a space character as the dividing symbol. Word breakers do not filter words and do not convert lowercase characters.
in practice, it can be used to support the processing of western symbols in a particular environment. There is no need to filter the Thesaurus support because Word filtering and lowercase character conversion are not completed. The lexical segmentation strategy simply uses non-English characters as the separator, and does not require word-breaker support.
5, Keywordanalyzer
Keywordanalyzer the entire input as a single lexical unit, facilitating the indexing and retrieval of special types of text. For postal code, address and other text information using the keyword Word breaker for index entries to establish very convenient.
6, Cjkanalyzer
cjkanalyzer Internal call Cjktokenizer word breaker, Chinese word segmentation, while using Stopfilter filter to complete the filtering function, can realize the Chinese multi-slice and stop word filtering. Deprecated in the Lucene3.0 version.
7, Chineseanalyzer
The Chineseanalyzer function and the StandardAnalyzer parser are basically consistent in the processing of Chinese, and are split into a single double-byte Chinese character. Deprecated in the Lucene3.0 version.
8, Perfieldanalyzerwrapper
The perfieldanalyzerwrapper function is mainly used for different applications where the analyzer is used for different field. For example, for file names, you need to use Keywordanalyzer, and only use StandardAnalyzer for the contents of the files. The classifier can be added by Addanalyzer ().
9, Ikanalyzer
The paper realizes the dictionary-based forward and backward total segmentation, and the two methods of maximum matching between positive and negative. Ikanalyzer is a third-party implementation of the word breaker, inherited from the Lucene Analyzer class, for Chinese text processing.
10, Je-analysis
Je-analysis is Lucene's Chinese word breaker and needs to be downloaded.
11, ictclas4j
ictclas4j Chinese Word segmentation system is sinboy in the CAs Zhang Huaping and Liu Qun teachers of the development of Freeictclas based on the completion of a Java open source word breaker, simplifying the complexity of the original word segmentation program, designed for the vast number of Chinese word lovers a better learning opportunities.
12, Imdict-chinese-analyzer
Imdict-chinese-analyzer is the intelligent Chinese word segmentation module of Imdict Intelligent Dictionary, which is based on hidden Markov model (Hidden Markov model, HMM), It is a re-implementation of the Chinese Ictclas (based on Java), which can provide simplified Chinese word support for Lucene search engine directly.
13. Paoding Analysis
paoding Analysis Chinese word segmentation has very high efficiency and high expansibility. The introduction of metaphor, the use of complete object-oriented design, advanced ideas. Its efficiency is high, in the PIII 1G memory personal machine, 1 seconds can accurately participle 1 million kanji. The paper is effectively segmented based on the unlimited number of dictionary files, so that the lexical classification can be defined. Able to interpret the unknown words rationally.
14, mmseg4j
mmseg4j uses Chih-hao Tsai's mmseg algorithm (http://technology.chtsai.org/mmseg/) to implement the Chinese word breaker, and to implement Lucene Analyzer and SOLR Tokeniz Erfactory for ease of use in Lucene and SOLR. MMSEG algorithm has two kinds of word segmentation methods: simple and complex, both are based on the forward maximum matching. Complex added four rules. Official said: The correct recognition rate of words reached 98.41%. MMSEG4J has implemented both of these segmentation algorithms.
Apache Lucene several participle system