Comparison of the main Lucene Chinese Word Divider

Source: Internet
Author: User

1. Basic Introduction:

Paoding: Lucene Chinese word segmentation "Ding Jie niu" paoding Analysis
Imdict: intelligent Chinese Word Segmentation used by the imdict intelligent dictionaryProgram
Mmseg4j: A Chinese Word divider implemented using the mmseg algorithm of Chih-hao Tsai
Ik: adopts the unique "Best granularity segmentation of Forward Iteration"Algorithm", Multi-processor Analysis Mode

2. developer and development activity:

Paoding: qieqie. Wang, last Google CodeCodeSubmitted: 2008-06-12, SVN version 132
Imdict: xiaopinggao, entering Lucene contriers, contrib/analyzers/smartcn/last submission in Lucene trunk,
Mmseg4j: chenlb2008, Google Code, version 57, log: mmseg4j-1.7 to create Branch
Ik: linliangyi2005, in Google Code, version 41

3. User-Defined dictionary:

Paoding: supports a user-defined dictionary with no limit. It is in plain text format and has a single line of words. It uses background threads to detect dictionary updates, automatically compile the updated dictionary to the binary version, and load it.
Imdict: User-Defined dictionary is not supported currently. However, the original ICTCLAS version is supported. Support user-defined stop words
Mmseg4j: comes with a sogou dictionary. It supports a user-defined dictionary named wordsxxx. DIC in utf8 text format, one line at a time. Automatic detection is not supported. -Dmmseg. DIC. Path
Ik: supports API-level user dictionary loading, and configuration-level dictionary file specified, bom-free UTF-8 encoding, \ r \ n segmentation. Automatic detection is not supported.

4. Speed (based on official introduction, not self-testing)

Paoding: On the personal machine with piII 1 GB memory,1 secondAccurate Word Segmentation1 millionChinese characters
Imdict:483.64(Byte/second ),259517(Chinese character/second)
Mmseg4j: complex records about kb/s, and simple records about 1900kb/s
Ik: high-speed processing capability of 0.5 million words/second

5. algorithm and Code complexity

Paoding: SVN src directory contains a total of 1.3 MB, 6 properties files, 48 java files, and 6895 lines. It is not very complicated to use a different knife to cut different types of streams.
Imdict: dictionary 6.7 m (this dictionary is required), src directory 152 K, 20 java files, 2399 lines. Using the ICTCLAS hhmm hidden Markov model, "a large number of corpus training is used to calculate the word frequency and jump probability of Chinese words. Based on these statistical results, likelihood is calculated for the entire Chinese sentence)"
Mmseg4j: SVN src directory contains 132 KB, 23 java files, and 2089 rows. Mmseg algorithm is a bit complicated.
Ik: SVN src directory 6.6 MB (dictionary file is also in it), 22 java files, 4217 lines. The multi-Sub-processor analysis is similar to paoding, and the ambiguity analysis algorithm is not clear yet.

6. Documentation

Paoding: almost none. There are some comments in the code, but because the implementation is complicated, it is still difficult to read the code.
Imdict: almost none. ICTCLAS does not have detailed documents. The hhmm hidden Markov model is too mathematical and not easy to understand.
Mmseg4j: The mmseg algorithm is in English, but its principle is relatively simple. The implementation is also clear.
Ik: a PDF user manual with examples and configuration instructions.

7. Others

Paoding: Introduce metaphor and make the design reasonable. This is used in search 1.0. The main advantage is that native supports Dictionary Update detection. The main disadvantage is that the author does not update or even maintain it.
Imdict: it has entered Lucene trunk. The original ICTCLAS has a good performance in various evaluations and has a solid theoretical foundation, not a personal store. The disadvantage is that the user dictionary is not supported currently.
Mmseg4j: The maximum word segmentation (max-word) is implemented based on complex, but it is not mature yet, and there are still many improvements to be made.
Ik: Query analyzer ikqueryparser optimized for Lucene full-text search

8. Conclusion

In my opinion, you can choose one from mmseg4j and paoding. For comparison of the two word splitting effects, refer:

Http://blog.chenlb.com/2009/04/mmseg4j-max-word-segment-compare-with-paoding-in-effect.html

Or you can package it yourself and implement a separate module for the paoding Dictionary Update detection. Then, you can seamlessly switch between all Word Segmentation Algorithms Based on the dictionary.

PSTo use different word divider for different fields. For example, for the tag field, you should use the simplest word divider, which can be split by space.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.