Several Apache Lucene word segmentation systems are recommended.

Source: Internet
Author: User
Tags create index deprecated lowercase solr split

1. StopAnalyzer

StopAnalyzer can filter specific strings and words in words and convert them to lowercase letters.

2. StandardAnalyzer

StandardAnalyzer performs word segmentation based on spaces and symbols. It can also analyze numbers, letters, e-mail addresses, IP addresses, and Chinese characters. It can also filter word lists, it is used to replace the filtering function implemented by StopAnalyzer.

3. SimpleAnalyzer

SimpleAnalyzer provides a word divider for analyzing basic Spanish characters. When processing a word unit, it uses non-letter characters as the delimiter. Word divider cannot filter words to analyze and separate words. The output vocabulary unit converts lowercase characters and removes punctuation marks and other delimiters.

In the development of the full-text search system, it is usually used to support the processing of Spanish symbols, but does not support Chinese characters. Because the word filtering function is not completed, you do not need to support word segmentation. The word segmentation policy is simple. Non-English characters are used as separators, and word segmentation is not required.

4. WhitespaceAnalyzer

WhitespaceAnalyzer uses spaces as the delimiter word divider. When processing a vocabulary unit, use space characters as the separator. The word divider does not filter words or convert lowercase characters.

In practice, it can be used to support the processing of Spanish symbols in a specific environment. Because the word filtering and lowercase character conversion functions are not completed, you do not need to support the word segmentation. The word segmentation policy simply uses non-English characters as the delimiter and does not require word segmentation.

5. KeywordAnalyzer

KeywordAnalyzer uses the entire input as a separate vocabulary unit to facilitate the indexing and retrieval of special types of text. It is very convenient to use the keyword tokenizer to create index items for ZIP Code, address, and other text information.

6. CJKAnalyzer

CJKAnalyzer internally calls the CJKTokenizer tokenizer to perform word segmentation for Chinese characters, and uses the StopFilter filter to filter Chinese words. It has been deprecated in Lucene3.0.

7. ChineseAnalyzer

The ChineseAnalyzer function is basically the same as that of StandardAnalyzer in processing Chinese characters. It is split into a single dual-byte Chinese character. It has been deprecated in Lucene3.0.

8. PerFieldAnalyzerWrapper

PerFieldAnalyzerWrapper is mainly used to use different Analyzer for different fields. For example, for file names, you need to use KeywordAnalyzer, and for file content, you only need to use StandardAnalyzer. You can use addAnalyzer () to add a classifier.

9. IKAnalyzer

The dictionary-based full split of forward and reverse directions and the maximum matching splitting of forward and reverse directions are implemented. IKAnalyzer is a third-party tokenizer that inherits from Lucene's Analyzer class and processes Chinese text.

10. JE-Analysis

JE-Analysis is the Lucene Chinese word segmentation component and needs to be downloaded.

11. ICTCLAS4J

The ictclas4j Chinese word segmentation system is a java open-source word segmentation project completed by sinboy based on FreeICTCLAS developed by Chinese Emy of Sciences Zhang Huaping and Liu Qun, which simplifies the complexity of the original word segmentation program, it aims to provide a better learning opportunity for the majority of Chinese word segmentation enthusiasts.

12. Imdict-Chinese-Analyzer

Imdict-chinese-analyzer is an intelligent chinese word segmentation module of the imdict Intelligent Dictionary. The algorithm is based on the Hidden Markov Model (HMM ), it is the re-implementation of the ictclas Chinese word segmentation program (based on Java) of the Institute of Computing Technology of the Chinese Emy of Sciences. It can directly provide support for simplified Chinese word segmentation for lucene search engines.

13. Paoding Analysis

Paoding Analysis Chinese word segmentation is highly efficient and scalable. Introduce metaphor, adopt completely object-oriented design, and have advanced ideas. Its efficiency is relatively high. On personal machines with PIII 1G memory, 1 million Chinese characters can be accurately segmented in one second. You can use dictionary files without limit to the number of words to effectively split an article so that you can define the word classification. Ability to properly parse unknown words.

14. MMSeg4J

Mmseg4j uses the Chinese word divider implemented by the MMSeg algorithm (http://technology.chtsai.org/mmseg/) of Chih-Hao Tsai, and implements lucene analyzer and solr TokenizerFactory to facilitate use in Lucene and Solr. The MMSeg algorithm has two word segmentation methods: Simple and Complex, which are based on the forward maximum matching. Complex adds four rules. Official saying: the correct word recognition rate reaches 98.41%. Mmseg4j has implemented these two word segmentation algorithms.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.