Comparison of the main Lucene Chinese Word Divider

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Basic Introduction:

Paoding: Lucene Chinese word segmentation "Ding Jie niu" paoding Analysis
Imdict: intelligent Chinese Word Segmentation used by the imdict intelligent dictionaryProgram
Mmseg4j: A Chinese Word divider implemented using the mmseg algorithm of Chih-hao Tsai
Ik: adopts the unique "Best granularity segmentation of Forward Iteration"Algorithm", Multi-processor Analysis Mode

2. developer and development activity:

Paoding: qieqie. Wang, last Google CodeCodeSubmitted: 2008-06-12, SVN version 132
Imdict: xiaopinggao, entering Lucene contriers, contrib/analyzers/smartcn/last submission in Lucene trunk,
Mmseg4j: chenlb2008, Google Code, version 57, log: mmseg4j-1.7 to create Branch
Ik: linliangyi2005, in Google Code, version 41

3. User-Defined dictionary:

Paoding: supports a user-defined dictionary with no limit. It is in plain text format and has a single line of words. It uses background threads to detect dictionary updates, automatically compile the updated dictionary to the binary version, and load it.
Imdict: User-Defined dictionary is not supported currently. However, the original ICTCLAS version is supported. Support user-defined stop words
Mmseg4j: comes with a sogou dictionary. It supports a user-defined dictionary named wordsxxx. DIC in utf8 text format, one line at a time. Automatic detection is not supported. -Dmmseg. DIC. Path
Ik: supports API-level user dictionary loading, and configuration-level dictionary file specified, bom-free UTF-8 encoding, \ r \ n segmentation. Automatic detection is not supported.

4. Speed (based on official introduction, not self-testing)

Paoding: On the personal machine with piII 1 GB memory,1 secondAccurate Word Segmentation1 millionChinese characters
Imdict:483.64(Byte/second ),259517(Chinese character/second)
Mmseg4j: complex records about kb/s, and simple records about 1900kb/s
Ik: high-speed processing capability of 0.5 million words/second

5. algorithm and Code complexity

Paoding: SVN src directory contains a total of 1.3 MB, 6 properties files, 48 java files, and 6895 lines. It is not very complicated to use a different knife to cut different types of streams.
Imdict: dictionary 6.7 m (this dictionary is required), src directory 152 K, 20 java files, 2399 lines. Using the ICTCLAS hhmm hidden Markov model, "a large number of corpus training is used to calculate the word frequency and jump probability of Chinese words. Based on these statistical results, likelihood is calculated for the entire Chinese sentence)"
Mmseg4j: SVN src directory contains 132 KB, 23 java files, and 2089 rows. Mmseg algorithm is a bit complicated.
Ik: SVN src directory 6.6 MB (dictionary file is also in it), 22 java files, 4217 lines. The multi-Sub-processor analysis is similar to paoding, and the ambiguity analysis algorithm is not clear yet.

6. Documentation

Paoding: almost none. There are some comments in the code, but because the implementation is complicated, it is still difficult to read the code.
Imdict: almost none. ICTCLAS does not have detailed documents. The hhmm hidden Markov model is too mathematical and not easy to understand.
Mmseg4j: The mmseg algorithm is in English, but its principle is relatively simple. The implementation is also clear.
Ik: a PDF user manual with examples and configuration instructions.

7. Others

Paoding: Introduce metaphor and make the design reasonable. This is used in search 1.0. The main advantage is that native supports Dictionary Update detection. The main disadvantage is that the author does not update or even maintain it.
Imdict: it has entered Lucene trunk. The original ICTCLAS has a good performance in various evaluations and has a solid theoretical foundation, not a personal store. The disadvantage is that the user dictionary is not supported currently.
Mmseg4j: The maximum word segmentation (max-word) is implemented based on complex, but it is not mature yet, and there are still many improvements to be made.
Ik: Query analyzer ikqueryparser optimized for Lucene full-text search

8. Conclusion

In my opinion, you can choose one from mmseg4j and paoding. For comparison of the two word splitting effects, refer:

Http://blog.chenlb.com/2009/04/mmseg4j-max-word-segment-compare-with-paoding-in-effect.html

Or you can package it yourself and implement a separate module for the paoding Dictionary Update detection. Then, you can seamlessly switch between all Word Segmentation Algorithms Based on the dictionary.

PSTo use different word divider for different fields. For example, for the tag field, you should use the simplest word divider, which can be split by space.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Comparison of the main Lucene Chinese Word Divider

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Comparison of the main Lucene Chinese Word Divider

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support