comparison of various Chinese participle for lucene
Author: Claymore time: 2011-09-09 17:53:26Arial Tahoma Verdana Italic font Decrease font increase
Several Chinese analyzers are compared in terms of accuracy and efficiency. The analyzer is: StandardAnalyzer, Chineseanalyzer, Cjkanalyzer, Ik_canalyzer, Mik_canalyzer, Mmanalyzer (je participle), Paodinganalyzer.
The implementation of Chinese word segmentation is generally indexed by words or by word index. Indexed by Word index, as the name suggests, is indexed by a single word. By the word index is the word, according to the words in the library, the words are divided. Che Dong Cross Double word segmentation or call two yuan participle I think it should be considered as an improvement of the index, it should still belong to the category of the index of the word.
Word Segmentation accuracy evaluation is difficult, it is difficult to have a unified standard, the requirements of different applications are different, this unification with "the Night of August 8, 2008, the world famous Beijing 29th Olympic Games Opening Ceremony held in the national stadium." "As an illustration.
Word segmentation efficiency, unified use of "Shoot carving Hero biography" of the full text as an example. Oh. For the parser based on the word index, using the unified Basic Thesaurus, the vocabulary is 227,719. Run in the development environment, performance is inaccurate, but relative values can be compared.
Analyzer:
by word |
StandardAnalyzer |
Lucene's own standard analyzer. |
|
Chineseanalyzer |
The parser included with the Lucene contrib is similar to StandardAnalyzer. Attention is similar, or there is a difference. |
|
Cjkanalyzer |
Two-yuan participle included in the Lucene contrib |
by word |
Ik_canalyzer, Mik_canalyzer |
http://lucene-group.group.javaeye.com/group/blog/165287. Use version for 2.0.2 |
|
Mmanalyzer |
The latest version that can be found now is 1.5.3. However, no downloads have been found on the original website and are said to be declared as not providing maintenance and support. Because there are more people to talk about, so list them out. But in the use of the feeling is not very stable. |
|
Paodinganalyzer |
Sunding. Http://code.google.com/p/paoding/downloads/list. Use the version 2.0.4beta. |
Word Segmentation accuracy:
StandardAnalyzer |
2008/year/8/month/8/day/night/lift/World/eyes/head//North/BEIJING/two/10/Nine/session/AO/Lin/horse/g///////////////////////////////////////////////////// |
A yuan participle, there is nothing to say. |
Chineseanalyzer |
Year/month/day/night/lift/World/eye/head//North/BEIJING/two/10/Nine/session/AO/Lin/horse/G/Yun/move/meeting/Open/curtain///In/country/home////////////////////////////////// |
There's a difference, because Chineseanalyzer only deals with Character.lowercase_letter, Character.uppercase_letter, and Character.other_letter. , and all the other types are filtered out. You can see the code in detail. |
Cjkanalyzer |
2008/year/8/month/8//World/worldwide/attention/purpose/North/Beijing/boe/second/20/19/Nine session/Olympiad/Olin/Limpi/g/g/Sport/Sports/meeting/Open/opening/episodic/in/in country/country/home/sports/breeding field/field///grand/Heavy lift/Hold/ |
Binary participle, as the improvement of one-yuan participle, the index is less than one, the query efficiency is better, can meet the general query requirements. |
Paodinganalyzer |
2008/year/8/month/8/day/night/World/attention/focus/Purpose/BEIJING/two/second/10/20/20th/Nine/19/29/Nine/Olympic/Olympic/sports/games/Olympic Games/Opening/opening/country/sports/Stadium/Grand/held/Grand hold / |
Fine-grained total segmentation. For the words not in the dictionary for two Yuan participle. |
Ik_canalyzer |
2008/2008/Year/8 months/8/Month/8th/8/night/world-renowned/world/attention/Purpose/BEIJING/29th/29th/20th/Second/29/20/19/Nine/Nine/Olympic/Olympic/Olympic/games/sports/opening/opening/open/in country/ Country/country/stadium/sports/Grand Hold/Grand/Hold/Line/ |
Fine-grained total segmentation. For the words not in the dictionary for two Yuan participle. |
Mik_canalyzer |
2008/8 months/8th/night/Worldwide attention/Purpose/BEIJING/29th/Olympic Games/opening/in country/country/stadium/Grand Hold/ |
Maximum matching participle. and fine-grained full segmentation combined with the use of. |
Mmanalyzer |
2008/year/8/month/8/day/night/Worldwide attention/BEIJING/20th/Nine/Olympic Games/opening/country/stadium/Grand Hold/ |
For what is not in the dictionary item, make a unary participle. |
Sub-speech Energy (MS):
Analyzer |
First time |
Second time |
Third time |
Number of participle |
StandardAnalyzer |
243 |
246 |
241 |
767675 |
Chineseanalyzer |
245 |
233 |
242 |
766298 |
Cjkanalyzer |
383 |
383 |
373 |
659264 |
Paodinganalyzer |
927 |
899 |
909 |
482890 |
Ik_canalyzer |
1842 |
1877 |
1855 |
530830 |
Mik_canalyzer |
2009 |
1978 |
1998 |
371013 |
Mmanalyzer |
2923 |
2933 |
2948 |
392521 |
It should be explained that Ik_canalyzer is more sensitive to dictionaries in performance.
Summarize:
For the general application, the use of two yuan participle method should be able to meet the demand. If you need word, from the word segmentation effect, performance, scalability, or maintainability to consider synthetically, the proposed use of Sunding.
mmseg4j multiple participle mode and paoding participle effect contrast published in: April 12, 2009 | Category: mmseg4j | Tags: mmseg4j, paoding, Chinese participle | Views (6,709)
Copyright information: You can reprint, reprint, please be sure to hyperlink form to indicate the original source of the article, that is the following statement.
Original source: http://blog.chenlb.com/2009/04/mmseg4j-max-word-segment-compare-with-paoding-in-effect.html
MMSEG4J 1.6 Support Most participle, should the user's request: participle effect and paoding contrast. The results of paoding partial participle are observed and summarized.
Paoding Word effect:-------------------------- Tsinghua University Tsinghua | Big | Hua da | College | -------------------------- South China University Huanan | Science and Technology | Big | University | -------------------------- Guangdong University of Technology Guangdong | Industry | Big | The Tycoon | University | -------------------------- Siberia Hebrew | Bethlehem | Siberian | -------------------------- Research life Origin Research | Graduate Student | Life | Origin | -------------------------- for primary consideration LED | First | Consider | -------------------------- makeup and apparel Makeup | Kimono | Costume | -------------------------- People's Bank China | Countrymen | People | Bank | -------------------------- People's Republic of China China | Chinese | People | Republic | Republic | -------------------------- Badminton Racket Feather | Badminton | Racket | -------------------------- RMB people | RMB | -------------------------- very nice good | Nice | -------------------------- Next Next | A | -------------------------- Why Why | -------------------------- Beijing Capital Airport Beijing | Capital | Airport | -------------------------- Things have been auctioned things | Already | Auctions | Sold the | -------------------------- host angry Owner | Angry | -------------------------- Although some animals are very vicious animals | Rogue | -------------------------- friends really betrayed you. Friends | true | Betrayal | -------------------------- Construction box Crab Social Construction | Box crab | Social | -------------------------- Construction Box Little Crab Society Construction | Box less | Crab Less | Social | -------------------------- The big ditch in front of our house is very sad. US | House | Before | Front door | ago | Big | Flood Waters | Ditch | Hard to | Sad | -------------------------- can not be as nutritious as fruit juice. can | Better than | If | Fruit Juice | Nutrition | Rich | -------------------------- It's really hot today, it's a good day to swim. Today | Naïve | Heat | Swimming Pool | Days | Good day | -------------------------- Sister's math is very, very embarrassing. Sister | Math | Only Test | Very | true | Shame | -------------------------- I do things first from easy. Work | Things | It's All | First from | calmly | Easy | Easy to | As a | -------------------------- teacher said that everyone should try their best in the Brigade relay tomorrow. Teacher | Teacher said | Description | Tomorrow | each | Personal | Ginseng | To | Big | Increase | Brigade | Relay | When | certain | Set to | To do | Try to | -------------------------- Xiao Ming take the stool as the first thing to do every morning to get up Xiaoming | Big | Stool | Bento Box | As a | Every day | Morning | Up | Get Up | Bed Part | First | Something | To do | do | Thing |
mmseg4j Maxword participle effect:-------------------------- Tsinghua University Tsinghua | College | -------------------------- South China University Huanan | Science and Technology | Gong da | University | -------------------------- Guangdong University of Technology Guangdong | Industry | University | -------------------------- Siberia West | Bethlehem | Leah | -------------------------- Research on the Origin Research of Life | Life | Origin | -------------------------- for primary consideration LED | to | Consider | -------------------------- makeup and apparel Makeup | and | Costume | -------------------------- People's Bank China | Countrymen | People | Bank | -------------------------- People's Republic of China China | Chinese | People | Republic | National | -------------------------- Badminton Racket Feather | Racket | -------------------------- RMB people | Currency | -------------------------- very nice good | Nice | -------------------------- Next Next | A | -------------------------- why for | What | -------------------------- Beijing Capital Airport Beijing | Capital | Airport | -------------------------- Things have been auctioned things | Already | Auctions | The | -------------------------- host angry Owner | Because | of | Angry | -------------------------- Although some animals are very vicious although | some | Animal | Very | Rogue | -------------------------- friends really betrayed you. Friends | true | Betrayal | The | You got it. | -------------------------- Construction box Crab Social Construction | box | Crab | Social | -------------------------- Construction Box Little Crab Society Construction | box | Less | Crab | Social | -------------------------- The big ditch in front of our house is very sad. US | Home | Front door | of | Flood Waters | Ditch | Hard to | | -------------------------- cans are not as nutritious as fruit juices. can | Better than | Fruit Juice | Nutrition | Rich | -------------------------- It's really hot today, it's a good day to swim. Today | Naïve | Heat | is | Swimming Pool | of | Good | Day | -------------------------- Sister's math is very, really embarrassing. Sister | of | Math | Only | Test | Very | true | Shame | -------------------------- I do things first from easy. I do | Things | It's All | First | calmly | Yi | of | As a | -------------------------- teacher said that everyone should try their best in the Brigade relay tomorrow. Teacher | Teacher said | Tomorrow | each | Personal | Participate in | Brigade | Pick upForce | When | certain | to | Try to | -------------------------- Xiao Ming take the stool as the first thing to do every morning to get up Xiaoming | Put | Stool | As a | Every day | Morning | Get Up | First | Something | To do | of | Things |
Paoding almost tore up all the sub words, sometimes there is the longest word, still do not understand "South University of South-South" will be divided into "big"; mmseg4j Maxword is in complex after the result of the words are removed (1.6 version by two yuan, not word to remove or save words. The next version may be a little different, "why" should not be "for | What", that is, three words before and after the word should not be divided, it remains to be studied,:)).
such as "make-up and clothing" mmseg4j complex can be better out of the ("makeup | and | Clothing "), and paoding less frequency information, more difficult to this matter. Mmseg4j Complex also has a disadvantage: "are first from the easy to do" can not be "easy to separate out, this is because the MMSEG algorithm is used 3 chunk reason, I think the whole sentence of the chunk (or just 3 chunk) to deal with, word segmentation effect is better, of course, more The cost of choosing 3 may be the effect and performance balance bar.
Mmseg4j did not add any stopword, this thing left the user to add, because I do not think it is a good way to add Stopword. such as music search, to add The,this ..., but also to find songs.
Of course participle effect is also related to thesaurus, Sogou's thesaurus is statistically, some high-frequency word combinations have also become words, such as "our". If you want to improve the mmseg4j effect, but also in the collation of the thesaurus.
comparison of several major lucene Chinese word breakers at present
Author: Tang Folin Source: Fu Lin Rain blog Cool Network collection 2009-08-04
1. Basic Introduction:
Paoding:lucene Chinese word "Sunding" paoding analysis
The Intelligent Chinese Word segmentation program used in Imdict:imdict Intelligent dictionary
mmseg4j: A Chinese word breaker based on the MMSEG algorithm of Chih-hao Tsai
IK: Using a unique "forward iterative most fine-grained segmentation algorithm", multi-sub processor analysis mode
2. Developer and development activity:
Paoding:qieqie.wang, Google code submitted the last time: 2008-06-12,SVN version number 132
Imdict:xiaopinggao, entered the Lucene Contribute,lucene trunk contrib/analyzers/smartcn/Last submitted: 2009-07-24,
Mmseg4j:chenlb2008,google Code 2009-08-03 (yesterday), version number 57,log: mmseg4j-1.7 Create a branch
Ik:linliangyi2005,google code 2009-07-31, version number 41
3. User-defined Word library:
Paoding: Support unlimited number of user-defined thesaurus, plain text format, a line of words, using background thread detection thesaurus update, automatic compilation of updated thesaurus to binary version, and load
Imdict: User-defined Thesaurus is temporarily not supported. But the original Ictclas support. Support User custom Stop words
MMSEG4J: With Sogou thesaurus, support user-defined thesaurus named Wordsxxx.dic, UTF8 text format, one line. Automatic detection is not supported. -dmmseg.dic.path
IK: Supports API-level user Word library loading, and configuration-level thesaurus file designation, no BOM UTF-8 encoding,/r/n segmentation. Automatic detection is not supported.
4. Speed (based on official introduction, not own test)
Paoding: In PIII 1G memory personal machine, 1 seconds can be accurate participle 1 million Chinese characters
imdict:483.64 (Bytes/sec), 259517 (Kanji/sec)
Mmseg4j:complex 1200kb/s around, simple 1900kb/s about
IK: high-speed processing capability with 500,000 characters/sec
5. Algorithm and code complexity
PAODING:SVN src directory Altogether 1.3m,6 a properties file, 48 java files, 6895 lines. It is not complicated to use knife to cut different types of streams.
Imdict: Thesaurus 6.7M (this thesaurus is required), SRC directory 152k,20 a java file, 2399 lines. Using the Ictclas hhmm hidden Markov model, "we use a lot of corpus training to statistic the word frequency and jump probability of Chinese vocabulary, so as to calculate the most likelihood (likelihood) of the whole Chinese sentence according to these statistic results"
MMSEG4J:SVN src directory Altogether 132k,23 a Java file, 2089 lines. MMSEG algorithm, a bit complicated.
IK:SVN src Catalog Altogether 6.6M (the dictionary file is also inside), 22 java files, 4217 lines. Multi-processor analysis, similar to paoding, the ambiguity analysis algorithm has not yet been figured out.
6. Documentation
Paoding: Almost nothing. There are some comments in the code, but because the implementation is more complex, there are some difficulties in reading code.
Imdict: Almost nothing. Ictclas also has no detailed documentation, the HHMM hidden Markov model is too mathematical to understand well.
MMSEG4J:MMSEG algorithm is in English, but the principle is relatively simple. Implementation is also relatively clear.
IK: There is a PDF manual with examples and configuration instructions.
7. Other
Paoding: The introduction of metaphor, the design is more reasonable. This is the search for version 1.0. The main advantage is the native support thesaurus update detection. The main disadvantage is that the author has not updated or even maintained.
Imdict: Into the Lucene trunk, the original Ictclas in a variety of evaluation has a good performance, there is a solid theoretical basis, not a personal cottage. The disadvantage is that the user thesaurus is temporarily unsupported.
MMSEG4J: Complex based on the implementation of the most participle (max-word), but not mature, there are many need to improve the place.
IK: Query Analyzer for lucene full-text Search optimization Ikqueryparser
8. Conclusion
Personally feel that you can choose between mmseg4j and paoding. About these two participle effect contrast, may refer to:
Http://blog.chenlb.com/2009/04/mmseg4j-max-word-segment-compare-with-paoding-in-effect.html
or the packaging of their own, will be paoding of the Word Library update detection to do a separate module implementation, and then can be in all thesaurus based word segmentation algorithm seamless switch between.
PS, the use of different word breakers for different field is a way to consider. For example, the tag field, you should use a simple word breaker, by the space participle can be.
This article is from: http://blog.fulin.org/2009/08/lucene_chinese_analyzer_compare.html