Several Apache Lucene word segmentation systems are recommended.

Last Update:2017-01-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. StopAnalyzer

StopAnalyzer can filter specific strings and words in words and convert them to lowercase letters.

2. StandardAnalyzer

StandardAnalyzer performs word segmentation based on spaces and symbols. It can also analyze numbers, letters, e-mail addresses, IP addresses, and Chinese characters. It can also filter word lists, it is used to replace the filtering function implemented by StopAnalyzer.

3. SimpleAnalyzer

SimpleAnalyzer provides a word divider for analyzing basic Spanish characters. When processing a word unit, it uses non-letter characters as the delimiter. Word divider cannot filter words to analyze and separate words. The output vocabulary unit converts lowercase characters and removes punctuation marks and other delimiters.

In the development of the full-text search system, it is usually used to support the processing of Spanish symbols, but does not support Chinese characters. Because the word filtering function is not completed, you do not need to support word segmentation. The word segmentation policy is simple. Non-English characters are used as separators, and word segmentation is not required.

4. WhitespaceAnalyzer

WhitespaceAnalyzer uses spaces as the delimiter word divider. When processing a vocabulary unit, use space characters as the separator. The word divider does not filter words or convert lowercase characters.

In practice, it can be used to support the processing of Spanish symbols in a specific environment. Because the word filtering and lowercase character conversion functions are not completed, you do not need to support the word segmentation. The word segmentation policy simply uses non-English characters as the delimiter and does not require word segmentation.

5. KeywordAnalyzer

KeywordAnalyzer uses the entire input as a separate vocabulary unit to facilitate the indexing and retrieval of special types of text. It is very convenient to use the keyword tokenizer to create index items for ZIP Code, address, and other text information.

6. CJKAnalyzer

CJKAnalyzer internally calls the CJKTokenizer tokenizer to perform word segmentation for Chinese characters, and uses the StopFilter filter to filter Chinese words. It has been deprecated in Lucene3.0.

7. ChineseAnalyzer

The ChineseAnalyzer function is basically the same as that of StandardAnalyzer in processing Chinese characters. It is split into a single dual-byte Chinese character. It has been deprecated in Lucene3.0.

8. PerFieldAnalyzerWrapper

PerFieldAnalyzerWrapper is mainly used to use different Analyzer for different fields. For example, for file names, you need to use KeywordAnalyzer, and for file content, you only need to use StandardAnalyzer. You can use addAnalyzer () to add a classifier.

9. IKAnalyzer

The dictionary-based full split of forward and reverse directions and the maximum matching splitting of forward and reverse directions are implemented. IKAnalyzer is a third-party tokenizer that inherits from Lucene's Analyzer class and processes Chinese text.

10. JE-Analysis

JE-Analysis is the Lucene Chinese word segmentation component and needs to be downloaded.

11. ICTCLAS4J

The ictclas4j Chinese word segmentation system is a java open-source word segmentation project completed by sinboy based on FreeICTCLAS developed by Chinese Emy of Sciences Zhang Huaping and Liu Qun, which simplifies the complexity of the original word segmentation program, it aims to provide a better learning opportunity for the majority of Chinese word segmentation enthusiasts.

12. Imdict-Chinese-Analyzer

Imdict-chinese-analyzer is an intelligent chinese word segmentation module of the imdict Intelligent Dictionary. The algorithm is based on the Hidden Markov Model (HMM ), it is the re-implementation of the ictclas Chinese word segmentation program (based on Java) of the Institute of Computing Technology of the Chinese Emy of Sciences. It can directly provide support for simplified Chinese word segmentation for lucene search engines.

13. Paoding Analysis

Paoding Analysis Chinese word segmentation is highly efficient and scalable. Introduce metaphor, adopt completely object-oriented design, and have advanced ideas. Its efficiency is relatively high. On personal machines with PIII 1G memory, 1 million Chinese characters can be accurately segmented in one second. You can use dictionary files without limit to the number of words to effectively split an article so that you can define the word classification. Ability to properly parse unknown words.

14. MMSeg4J

Mmseg4j uses the Chinese word divider implemented by the MMSeg algorithm (http://technology.chtsai.org/mmseg/) of Chih-Hao Tsai, and implements lucene analyzer and solr TokenizerFactory to facilitate use in Lucene and Solr. The MMSeg algorithm has two word segmentation methods: Simple and Complex, which are based on the forward maximum matching. Complex adds four rules. Official saying: the correct word recognition rate reaches 98.41%. Mmseg4j has implemented these two word segmentation algorithms.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Several Apache Lucene word segmentation systems are recommended.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Several Apache Lucene word segmentation systems are recommended.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support