1. StopAnalyzer
StopAnalyzer can filter specific strings and words in words and convert them to lowercase letters.
2. StandardAnalyzer
StandardAnalyzer performs word segmentation based on spaces and symbols. It can also analyze numbers, letters, e-mail addresses, IP addresses, and Chinese characters. It can also filter word lists, it is used to replace the filtering function implemented by StopAnalyzer.
3. SimpleAnalyzer
SimpleAnalyzer provides a word divider for analyzing basic Spanish characters. When processing a word unit, it uses non-letter characters as the delimiter. Word divider cannot filter words to analyze and separate words. The output vocabulary unit converts lowercase characters and removes punctuation marks and other delimiters.
In the development of the full-text search system, it is usually used to support the processing of Spanish symbols, but does not support Chinese characters. Because the word filtering function is not completed, you do not need to support word segmentation. The word segmentation policy is simple. Non-English characters are used as separators, and word segmentation is not required.
4. WhitespaceAnalyzer
WhitespaceAnalyzer uses spaces as the delimiter word divider. When processing a vocabulary unit, use space characters as the separator. The word divider does not filter words or convert lowercase characters.
In practice, it can be used to support the processing of Spanish symbols in a specific environment. Because the word filtering and lowercase character conversion functions are not completed, you do not need to support the word segmentation. The word segmentation policy simply uses non-English characters as the delimiter and does not require word segmentation.
5. KeywordAnalyzer
KeywordAnalyzer uses the entire input as a separate vocabulary unit to facilitate the indexing and retrieval of special types of text. It is very convenient to use the keyword tokenizer to create index items for ZIP Code, address, and other text information.
6. CJKAnalyzer
CJKAnalyzer internally calls the CJKTokenizer tokenizer to perform word segmentation for Chinese characters, and uses the StopFilter filter to filter Chinese words. It has been deprecated in Lucene3.0.
7. ChineseAnalyzer
The ChineseAnalyzer function is basically the same as that of StandardAnalyzer in processing Chinese characters. It is split into a single dual-byte Chinese character. It has been deprecated in Lucene3.0.
8. PerFieldAnalyzerWrapper
PerFieldAnalyzerWrapper is mainly used to use different Analyzer for different fields. For example, for file names, you need to use KeywordAnalyzer, and for file content, you only need to use StandardAnalyzer. You can use addAnalyzer () to add a classifier.
9. IKAnalyzer
The dictionary-based full split of forward and reverse directions and the maximum matching splitting of forward and reverse directions are implemented. IKAnalyzer is a third-party tokenizer that inherits from Lucene's Analyzer class and processes Chinese text.
10. JE-Analysis
JE-Analysis is the Lucene Chinese word segmentation component and needs to be downloaded.
11. ICTCLAS4J
The ictclas4j Chinese word segmentation system is a java open-source word segmentation project completed by sinboy based on FreeICTCLAS developed by Chinese Emy of Sciences Zhang Huaping and Liu Qun, which simplifies the complexity of the original word segmentation program, it aims to provide a better learning opportunity for the majority of Chinese word segmentation enthusiasts.
12. Imdict-Chinese-Analyzer
Imdict-chinese-analyzer is an intelligent chinese word segmentation module of the imdict Intelligent Dictionary. The algorithm is based on the Hidden Markov Model (HMM ), it is the re-implementation of the ictclas Chinese word segmentation program (based on Java) of the Institute of Computing Technology of the Chinese Emy of Sciences. It can directly provide support for simplified Chinese word segmentation for lucene search engines.
13. Paoding Analysis
Paoding Analysis Chinese word segmentation is highly efficient and scalable. Introduce metaphor, adopt completely object-oriented design, and have advanced ideas. Its efficiency is relatively high. On personal machines with PIII 1G memory, 1 million Chinese characters can be accurately segmented in one second. You can use dictionary files without limit to the number of words to effectively split an article so that you can define the word classification. Ability to properly parse unknown words.
14. MMSeg4J
Mmseg4j uses the Chinese word divider implemented by the MMSeg algorithm (http://technology.chtsai.org/mmseg/) of Chih-Hao Tsai, and implements lucene analyzer and solr TokenizerFactory to facilitate use in Lucene and Solr. The MMSeg algorithm has two word segmentation methods: Simple and Complex, which are based on the forward maximum matching. Complex adds four rules. Official saying: the correct word recognition rate reaches 98.41%. Mmseg4j has implemented these two word segmentation algorithms.