Evaluation of Lucene word segmentation components pangu and mmseg4j

Source: Internet
Author: User
Document directory
  • Original
  • Pangu
  • Mmseg
  • Evaluation
  • Original
  • Pangu
  • Mmseg
  • Evaluation
  • Download
Preface

There are not many word segmentation components under. Net. Recently, we have seen that Baoyu released an improved version of mmseg word segmentation, which is just a comparison with the long-used pangu word segmentation.

Pangu is a machine to achieve word segmentation, more detailed analysis of http://www.cnblogs.com/eaglet/archive/2008/10/02/1303142.html

Mmseg algorithm is relatively advanced, more detailed explanation: http://www.coreseek.cn/opensource/mmseg/

Here we only compare the default configuration of pangu, because the one-dollar word segmentation is not enabled by default, and mmseg only compares the configuration of maxword, with the goal of the efficiency and effect of multiple word segmentation.

Efficiency Comparison

Hardware configuration: CPU i7 2.3 GHz RAM 4 GB

Pangu Word Segmentation

Official efficiency: Core Duo 1.8 GHz Single-thread word splitting speed is 390 K characters per second, 2-thread word splitting speed is 690 K characters per second.

 

Default single-thread configuration: 395kw/s, which is slower than the official one because the hardware configuration is higher than the official one, but similar.

The default configuration is multithreading: 744.7, which is similar to the official version;

Mmseg Word Segmentation

Official efficiency (java ):

  • The word splitting speed of version 1.5 is about 2800 kb/s, and the complex algorithm is about kb/s (Test Machine: AMD athlon 64 + 1 GB memory xp ).
  • The 1.6 version implements the maximum word segmentation (max-word) based on complex ). "Nice to hear"-> "Good | nice to hear"; "People's Republic of China"-> "China | Chinese | Republican | national "; "People's Bank of China"-> "China | people's | bank ".
  • 1.7-beta version. Currently, complex has about kb/s and about kb/s, but the memory overhead is about 50 MB. The previous versions are about 10 MB.
  • After 1.8, added the CutLetterDigitFilter transmitter to split the "letters and numbers. For example, mb991ch is switched to mb 991 ch ".

MaxWord single thread: It is much slower than pangu and may be related to the complexity of mmseg algorithm. Each chunk must calculate four factors.

Multi-thread in MaxWord mode: there is not much improvement, which also shows that the bottleneck of mmseg is actually computed.

Instance analysis and comparison Example 1

[Use Data to tell you how popular mobile games are.] Today, as part of this GMIC, the GGS global mobile games summit is held. The guests and game developers discussed the current situation and development trend of mobile games. Mobile games are the most important keywords. Qian Donghai, president of Shanda game, shared the prediction data of CEO of Japan's largest mobile game company: 2015 of the global gaming industry in 80% were mobile games. Http://t.cn/zTHdkFY

Pangu

Use/data/tell/you/how hot the mobile game is/today/AS/current/GMIC/part/GGS/global/mobile/GAME/summit/hold/guest/and /games/developers/people/discussion/mobile/games/status quo/AND/development trends/mobile games are/Most/important/1/large/keywords /Shanda/GAME/president/Qian/Donghai/share/Japan/max/mobile/GAME/Company/CEO/prediction/data/2015/year/global/Game /industry/pattern/medium/80/All/mobile phone/GAME/http/t/cn/zTHdkFY

Mmseg

Use/data/tell/your mobile/GAME/have multiple/hot/today/AS/current/gmic/part/ggs/global/mobile/GAME/summit/ hold/guest/AND/GAME/developer/people/discussion/mobile/GAME/current situation/AND/development/trend/mobile/GAME/is/Most /important/major/key/word/Grand/GAME/president/Qian/Donghai/share/Japan/Largest/mobile/GAME/Company/ceo/prediction/ /data/2015/year/global/GAME/industry/pattern/medium/80/All/mobile/GAME/http/t/cn/zthdkfy

Evaluation

Pangu's Word Segmentation should be ideal, and mmseg's "your hand" Word Segmentation is not ideal;

Example 2 Original

As mentioned in an article a few days ago, during batch insertion, Context. Configuration. AutoDetectChangesEnabled = false must be added;

In the original article, EF automatically tracks data changes by default. When the volume of changed data is large, EF's tracking workload will surge, however, the specified operation becomes very slow (this is also a suspicion of some students who suspect EF performance issues). In fact, as long as automatic update is disabled during batch operations, to solve the slow problem.

Let's take a look at it ourselves: http://www.cnblogs.com/guomingfeng/archive/2013/05/28/mvc-ef-repository.html because there is no test, so I don't know whether the result is faster, faster

Results In the morning I found the home page a test Entity Framework of the article, http://www.cnblogs.com/newton/archive/2013/06/06/3120497.html said EF performance is not good, just did not add this sentence, I am also curious, so I test, write this article is not for the author, just curious and discuss with each other

Pangu

A few days ago/saw/an article/mentioned/In/batch/inserted/accessed/required/added/Context/Configuration/AutoDetectChangesEnabled/=/false/article/original /EF/default/automatic/tracking/data/changes/when/change/data volume/large/when/EF/tracking/ workload/will/surge/but/specify/Operation/become/very/slow/This/also/part/students/suspect/EF/performance/problem/ one/suspect/point /) /actually/as long as/In/batch/Operation/when/update/Close/CAN/solve/slow/problem/everyone/yourself/go/check/ http/www/cnblogs/com/guomingfeng/archive/2013/05/28/mvc/ef/repository/html/due to/no/test/SO/no/know/result/is it/ change/fast/How many/results/morning/Discovery/homepage/One/test/Entity/Framework/article/http/www/cnblogs/com/newton/archive/ 2013/06/06/3120497. /html//EF/performance/no/exactly/not added this sentence/ME/also/curious/SO/yourself/test/down/Write/This/article/not/ /author/just/yourself/curious/mutual/discussion

Mmseg

Previous/days/saw/article/In/mentioned/In/batch/inserted/required/added/context/configuration/autodetectchangesenabled/false/article/original /ef/default/automatic/tracking/data/changes/when/change/data/volume/large/when/ef/ tracking/work/volume/will/surge/but/specify/Operation/become/very/slow/This also/Yes/part/students/suspect/ef/performance/ problem/One/suspect/point/Actual/as long as/batch/Operation/AT/time/automatic/update/Close/CAN/solve/slow/ question/everyone/View/http/www/cnblogs/com/guomingfeng/archive/2013/05/28/mvc/ef/repository/html/AS/no/test/ so/do not know/result/no/change/fast/How much/result/morning/Discovery/homepage/One/test/entity/framework/Article /http/www/cnblogs/com/newton/archive/2013/06/06/3120497/html//ef/performance/no/exactly/no/Add/This sentence/I also/curious /So/yourself/test/down/Write/This/article/not/For/author/just/yourself/curious/mutual/discussion

Evaluation

The mmseg word segmentation is finer, and the word segmentation for "a few days ago" and "I don't know if the result is much faster" is better than pangu.

Summary

There are not many use cases for the final word segmentation evaluation. For more use cases, you can download the source code below and test it by yourself. Pangu on the word divider is more complete and rich in functions. mmseg is more about algorithm implementation, but lacks functions.

Download

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.