Comparison of various Chinese participle for lucene

Source: Internet
Author: User
Tags italic font ming pdf manual svn
comparison of various Chinese participle for lucene Author: Claymore time: 2011-09-09 17:53:26Arial Tahoma Verdana Italic font Decrease font increase

Several Chinese analyzers are compared in terms of accuracy and efficiency. The analyzer is: StandardAnalyzer, Chineseanalyzer, Cjkanalyzer, Ik_canalyzer, Mik_canalyzer, Mmanalyzer (je participle), Paodinganalyzer.
The implementation of Chinese word segmentation is generally indexed by words or by word index. Indexed by Word index, as the name suggests, is indexed by a single word. By the word index is the word, according to the words in the library, the words are divided. Che Dong Cross Double word segmentation or call two yuan participle I think it should be considered as an improvement of the index, it should still belong to the category of the index of the word.
Word Segmentation accuracy evaluation is difficult, it is difficult to have a unified standard, the requirements of different applications are different, this unification with "the Night of August 8, 2008, the world famous Beijing 29th Olympic Games Opening Ceremony held in the national stadium." "As an illustration.
Word segmentation efficiency, unified use of "Shoot carving Hero biography" of the full text as an example. Oh. For the parser based on the word index, using the unified Basic Thesaurus, the vocabulary is 227,719. Run in the development environment, performance is inaccurate, but relative values can be compared.

Analyzer:


by word

StandardAnalyzer

Lucene's own standard analyzer.

Chineseanalyzer

The parser included with the Lucene contrib is similar to StandardAnalyzer. Attention is similar, or there is a difference.

Cjkanalyzer

Two-yuan participle included in the Lucene contrib

by word

Ik_canalyzer, Mik_canalyzer

http://lucene-group.group.javaeye.com/group/blog/165287. Use version for 2.0.2

Mmanalyzer

The latest version that can be found now is 1.5.3. However, no downloads have been found on the original website and are said to be declared as not providing maintenance and support. Because there are more people to talk about, so list them out. But in the use of the feeling is not very stable.

Paodinganalyzer

Sunding. Http://code.google.com/p/paoding/downloads/list. Use the version 2.0.4beta.

Word Segmentation accuracy:


StandardAnalyzer

2008/year/8/month/8/day/night/lift/World/eyes/head//North/BEIJING/two/10/Nine/session/AO/Lin/horse/g/////////////////////////////////////////////////////

A yuan participle, there is nothing to say.

Chineseanalyzer

Year/month/day/night/lift/World/eye/head//North/BEIJING/two/10/Nine/session/AO/Lin/horse/G/Yun/move/meeting/Open/curtain///In/country/home//////////////////////////////////

There's a difference, because Chineseanalyzer only deals with Character.lowercase_letter, Character.uppercase_letter, and Character.other_letter. , and all the other types are filtered out. You can see the code in detail.

Cjkanalyzer

2008/year/8/month/8//World/worldwide/attention/purpose/North/Beijing/boe/second/20/19/Nine session/Olympiad/Olin/Limpi/g/g/Sport/Sports/meeting/Open/opening/episodic/in/in country/country/home/sports/breeding field/field///grand/Heavy lift/Hold/

Binary participle, as the improvement of one-yuan participle, the index is less than one, the query efficiency is better, can meet the general query requirements.

Paodinganalyzer

2008/year/8/month/8/day/night/World/attention/focus/Purpose/BEIJING/two/second/10/20/20th/Nine/19/29/Nine/Olympic/Olympic/sports/games/Olympic Games/Opening/opening/country/sports/Stadium/Grand/held/Grand hold /

Fine-grained total segmentation. For the words not in the dictionary for two Yuan participle.

Ik_canalyzer

2008/2008/Year/8 months/8/Month/8th/8/night/world-renowned/world/attention/Purpose/BEIJING/29th/29th/20th/Second/29/20/19/Nine/Nine/Olympic/Olympic/Olympic/games/sports/opening/opening/open/in country/ Country/country/stadium/sports/Grand Hold/Grand/Hold/Line/

Fine-grained total segmentation. For the words not in the dictionary for two Yuan participle.

Mik_canalyzer

2008/8 months/8th/night/Worldwide attention/Purpose/BEIJING/29th/Olympic Games/opening/in country/country/stadium/Grand Hold/

Maximum matching participle. and fine-grained full segmentation combined with the use of.

Mmanalyzer

2008/year/8/month/8/day/night/Worldwide attention/BEIJING/20th/Nine/Olympic Games/opening/country/stadium/Grand Hold/

For what is not in the dictionary item, make a unary participle.

Sub-speech Energy (MS):

Analyzer

First time

Second time

Third time

Number of participle

StandardAnalyzer

243

246

241

767675

Chineseanalyzer

245

233

242

766298

Cjkanalyzer

383

383

373

659264

Paodinganalyzer

927

899

909

482890

Ik_canalyzer

1842

1877

1855

530830

Mik_canalyzer

2009

1978

1998

371013

Mmanalyzer

2923

2933

2948

392521

It should be explained that Ik_canalyzer is more sensitive to dictionaries in performance.
Summarize:
For the general application, the use of two yuan participle method should be able to meet the demand. If you need word, from the word segmentation effect, performance, scalability, or maintainability to consider synthetically, the proposed use of Sunding.

mmseg4j multiple participle mode and paoding participle effect contrast published in: April 12, 2009 | Category: mmseg4j | Tags: mmseg4j, paoding, Chinese participle | Views (6,709)

Copyright information: You can reprint, reprint, please be sure to hyperlink form to indicate the original source of the article, that is the following statement.

Original source: http://blog.chenlb.com/2009/04/mmseg4j-max-word-segment-compare-with-paoding-in-effect.html

MMSEG4J 1.6 Support Most participle, should the user's request: participle effect and paoding contrast. The results of paoding partial participle are observed and summarized.

Paoding Word effect:--------------------------  Tsinghua University   Tsinghua | Big | Hua da | College | --------------------------  South China University   Huanan | Science and Technology | Big | University | --------------------------  Guangdong University of Technology   Guangdong | Industry | Big | The Tycoon | University | --------------------------  Siberia   Hebrew | Bethlehem | Siberian | --------------------------  Research life Origin   Research | Graduate Student | Life | Origin | --------------------------  for primary consideration   LED | First | Consider | --------------------------  makeup and apparel   Makeup | Kimono | Costume | --------------------------  People's Bank   China | Countrymen | People | Bank | --------------------------  People's Republic of China   China | Chinese | People | Republic | Republic | --------------------------  Badminton Racket   Feather | Badminton | Racket | --------------------------  RMB   people | RMB | --------------------------  very nice   good | Nice | --------------------------  Next   Next | A | --------------------------  Why   Why | --------------------------  Beijing Capital Airport   Beijing | Capital | Airport | --------------------------  Things have been auctioned   things | Already | Auctions | Sold the | --------------------------  host angry   Owner | Angry | --------------------------  Although some animals are very vicious   animals | Rogue | --------------------------  friends really betrayed you.   Friends | true | Betrayal | --------------------------  Construction box Crab Social   Construction | Box crab | Social | --------------------------  Construction Box Little Crab Society   Construction | Box less | Crab Less | Social | --------------------------  The big ditch in front of our house is very sad.   US | House | Before | Front door | ago | Big | Flood Waters | Ditch | Hard to | Sad | --------------------------  can not be as nutritious as fruit juice.   can | Better than | If | Fruit Juice | Nutrition | Rich | --------------------------  It's really hot today, it's a good day to swim.   Today | Naïve | Heat | Swimming Pool | Days | Good day | --------------------------  Sister's math is very, very embarrassing.   Sister | Math | Only Test | Very | true | Shame | --------------------------  I do things first from easy.   Work | Things | It's All | First from | calmly | Easy | Easy to | As a | --------------------------  teacher said that everyone should try their best in the Brigade relay tomorrow.   Teacher | Teacher said | Description | Tomorrow | each | Personal | Ginseng | To | Big | Increase | Brigade | Relay | When | certain | Set to | To do | Try to | --------------------------  Xiao Ming take the stool as the first thing to do every morning to get up   Xiaoming | Big | Stool | Bento Box | As a | Every day | Morning | Up | Get Up | Bed Part | First | Something | To do | do | Thing |  

mmseg4j Maxword participle effect:--------------------------  Tsinghua University   Tsinghua | College | --------------------------  South China University   Huanan | Science and Technology | Gong da | University | --------------------------  Guangdong University of Technology   Guangdong | Industry | University | --------------------------  Siberia   West | Bethlehem | Leah | --------------------------  Research on the Origin   Research of Life | Life | Origin | --------------------------  for primary consideration   LED | to | Consider | --------------------------  makeup and apparel   Makeup | and | Costume | --------------------------  People's Bank   China | Countrymen | People | Bank | --------------------------  People's Republic of China   China | Chinese | People | Republic | National | --------------------------  Badminton Racket   Feather | Racket | --------------------------  RMB   people | Currency | --------------------------  very nice   good | Nice | --------------------------  Next   Next | A | --------------------------  why   for | What | --------------------------  Beijing Capital Airport   Beijing | Capital | Airport | --------------------------  Things have been auctioned   things | Already | Auctions | The | --------------------------  host angry   Owner | Because | of | Angry | --------------------------  Although some animals are very vicious   although | some | Animal | Very | Rogue | --------------------------  friends really betrayed you.   Friends | true | Betrayal | The | You got it. | --------------------------  Construction box Crab Social   Construction | box | Crab | Social | --------------------------  Construction Box Little Crab Society   Construction | box | Less | Crab | Social | --------------------------  The big ditch in front of our house is very sad.   US | Home | Front door | of | Flood Waters | Ditch | Hard to | | --------------------------  cans are not as nutritious as fruit juices.   can | Better than | Fruit Juice | Nutrition | Rich | --------------------------  It's really hot today, it's a good day to swim.   Today | Naïve | Heat | is | Swimming Pool | of | Good | Day | --------------------------  Sister's math is very, really embarrassing.   Sister | of | Math | Only | Test | Very | true | Shame | --------------------------  I do things first from easy.   I do | Things | It's All | First | calmly | Yi | of | As a | --------------------------  teacher said that everyone should try their best in the Brigade relay tomorrow.   Teacher | Teacher said | Tomorrow | each | Personal | Participate in | Brigade | Pick upForce | When | certain | to | Try to | --------------------------  Xiao Ming take the stool as the first thing to do every morning to get up   Xiaoming | Put | Stool | As a | Every day | Morning | Get Up | First | Something | To do | of | Things |  

Paoding almost tore up all the sub words, sometimes there is the longest word, still do not understand "South University of South-South" will be divided into "big"; mmseg4j Maxword is in complex after the result of the words are removed (1.6 version by two yuan, not word to remove or save words. The next version may be a little different, "why" should not be "for | What", that is, three words before and after the word should not be divided, it remains to be studied,:)).

such as "make-up and clothing" mmseg4j complex can be better out of the ("makeup | and | Clothing "), and paoding less frequency information, more difficult to this matter. Mmseg4j Complex also has a disadvantage: "are first from the easy to do" can not be "easy to separate out, this is because the MMSEG algorithm is used 3 chunk reason, I think the whole sentence of the chunk (or just 3 chunk) to deal with, word segmentation effect is better, of course, more The cost of choosing 3 may be the effect and performance balance bar.

Mmseg4j did not add any stopword, this thing left the user to add, because I do not think it is a good way to add Stopword. such as music search, to add The,this ..., but also to find songs.

Of course participle effect is also related to thesaurus, Sogou's thesaurus is statistically, some high-frequency word combinations have also become words, such as "our". If you want to improve the mmseg4j effect, but also in the collation of the thesaurus.

comparison of several major lucene Chinese word breakers at present

Author: Tang Folin Source: Fu Lin Rain blog Cool Network collection 2009-08-04

1. Basic Introduction:

Paoding:lucene Chinese word "Sunding" paoding analysis
The Intelligent Chinese Word segmentation program used in Imdict:imdict Intelligent dictionary
mmseg4j: A Chinese word breaker based on the MMSEG algorithm of Chih-hao Tsai
IK: Using a unique "forward iterative most fine-grained segmentation algorithm", multi-sub processor analysis mode

2. Developer and development activity:

Paoding:qieqie.wang, Google code submitted the last time: 2008-06-12,SVN version number 132
Imdict:xiaopinggao, entered the Lucene Contribute,lucene trunk contrib/analyzers/smartcn/Last submitted: 2009-07-24,
Mmseg4j:chenlb2008,google Code 2009-08-03 (yesterday), version number 57,log: mmseg4j-1.7 Create a branch
Ik:linliangyi2005,google code 2009-07-31, version number 41

3. User-defined Word library:

Paoding: Support unlimited number of user-defined thesaurus, plain text format, a line of words, using background thread detection thesaurus update, automatic compilation of updated thesaurus to binary version, and load
Imdict: User-defined Thesaurus is temporarily not supported. But the original Ictclas support. Support User custom Stop words
MMSEG4J: With Sogou thesaurus, support user-defined thesaurus named Wordsxxx.dic, UTF8 text format, one line. Automatic detection is not supported. -dmmseg.dic.path
IK: Supports API-level user Word library loading, and configuration-level thesaurus file designation, no BOM UTF-8 encoding,/r/n segmentation. Automatic detection is not supported.

4. Speed (based on official introduction, not own test)

Paoding: In PIII 1G memory personal machine, 1 seconds can be accurate participle 1 million Chinese characters
imdict:483.64 (Bytes/sec), 259517 (Kanji/sec)
Mmseg4j:complex 1200kb/s around, simple 1900kb/s about
IK: high-speed processing capability with 500,000 characters/sec

5. Algorithm and code complexity

PAODING:SVN src directory Altogether 1.3m,6 a properties file, 48 java files, 6895 lines. It is not complicated to use knife to cut different types of streams.
Imdict: Thesaurus 6.7M (this thesaurus is required), SRC directory 152k,20 a java file, 2399 lines. Using the Ictclas hhmm hidden Markov model, "we use a lot of corpus training to statistic the word frequency and jump probability of Chinese vocabulary, so as to calculate the most likelihood (likelihood) of the whole Chinese sentence according to these statistic results"
MMSEG4J:SVN src directory Altogether 132k,23 a Java file, 2089 lines. MMSEG algorithm, a bit complicated.
IK:SVN src Catalog Altogether 6.6M (the dictionary file is also inside), 22 java files, 4217 lines. Multi-processor analysis, similar to paoding, the ambiguity analysis algorithm has not yet been figured out.

6. Documentation

Paoding: Almost nothing. There are some comments in the code, but because the implementation is more complex, there are some difficulties in reading code.
Imdict: Almost nothing. Ictclas also has no detailed documentation, the HHMM hidden Markov model is too mathematical to understand well.
MMSEG4J:MMSEG algorithm is in English, but the principle is relatively simple. Implementation is also relatively clear.
IK: There is a PDF manual with examples and configuration instructions.

7. Other

Paoding: The introduction of metaphor, the design is more reasonable. This is the search for version 1.0. The main advantage is the native support thesaurus update detection. The main disadvantage is that the author has not updated or even maintained.
Imdict: Into the Lucene trunk, the original Ictclas in a variety of evaluation has a good performance, there is a solid theoretical basis, not a personal cottage. The disadvantage is that the user thesaurus is temporarily unsupported.
MMSEG4J: Complex based on the implementation of the most participle (max-word), but not mature, there are many need to improve the place.
IK: Query Analyzer for lucene full-text Search optimization Ikqueryparser

8. Conclusion

Personally feel that you can choose between mmseg4j and paoding. About these two participle effect contrast, may refer to:

Http://blog.chenlb.com/2009/04/mmseg4j-max-word-segment-compare-with-paoding-in-effect.html

or the packaging of their own, will be paoding of the Word Library update detection to do a separate module implementation, and then can be in all thesaurus based word segmentation algorithm seamless switch between.

PS, the use of different word breakers for different field is a way to consider. For example, the tag field, you should use a simple word breaker, by the space participle can be.

This article is from: http://blog.fulin.org/2009/08/lucene_chinese_analyzer_compare.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.