Document directory
- Original
- Pangu
- Mmseg
- Evaluation
- Original
- Pangu
- Mmseg
- Evaluation
- Download
Preface
There are not many word segmentation components under. Net. Recently, we have seen that Baoyu released an improved version of mmseg word segmentation, which is just a comparison with the long-used pangu word segmentation.
Pangu is a machine to achieve word segmentation, more detailed analysis of http://www.cnblogs.com/eaglet/archive/2008/10/02/1303142.html
Mmseg algorithm is relatively advanced, more detailed explanation: http://www.coreseek.cn/opensource/mmseg/
Here we only compare the default configuration of pangu, because the one-dollar word segmentation is not enabled by default, and mmseg only compares the configuration of maxword, with the goal of the efficiency and effect of multiple word segmentation.
Efficiency Comparison
Hardware configuration: CPU i7 2.3 GHz RAM 4 GB
Pangu Word Segmentation
Official efficiency: Core Duo 1.8 GHz Single-thread word splitting speed is 390 K characters per second, 2-thread word splitting speed is 690 K characters per second.
Default single-thread configuration: 395kw/s, which is slower than the official one because the hardware configuration is higher than the official one, but similar.
The default configuration is multithreading: 744.7, which is similar to the official version;
Mmseg Word Segmentation
Official efficiency (java ):
- The word splitting speed of version 1.5 is about 2800 kb/s, and the complex algorithm is about kb/s (Test Machine: AMD athlon 64 + 1 GB memory xp ).
- The 1.6 version implements the maximum word segmentation (max-word) based on complex ). "Nice to hear"-> "Good | nice to hear"; "People's Republic of China"-> "China | Chinese | Republican | national "; "People's Bank of China"-> "China | people's | bank ".
- 1.7-beta version. Currently, complex has about kb/s and about kb/s, but the memory overhead is about 50 MB. The previous versions are about 10 MB.
- After 1.8, added the CutLetterDigitFilter transmitter to split the "letters and numbers. For example, mb991ch is switched to mb 991 ch ".
MaxWord single thread: It is much slower than pangu and may be related to the complexity of mmseg algorithm. Each chunk must calculate four factors.
Multi-thread in MaxWord mode: there is not much improvement, which also shows that the bottleneck of mmseg is actually computed.
Instance analysis and comparison Example 1
[Use Data to tell you how popular mobile games are.] Today, as part of this GMIC, the GGS global mobile games summit is held. The guests and game developers discussed the current situation and development trend of mobile games. Mobile games are the most important keywords. Qian Donghai, president of Shanda game, shared the prediction data of CEO of Japan's largest mobile game company: 2015 of the global gaming industry in 80% were mobile games. Http://t.cn/zTHdkFY
Pangu
Use/data/tell/you/how hot the mobile game is/today/AS/current/GMIC/part/GGS/global/mobile/GAME/summit/hold/guest/and /games/developers/people/discussion/mobile/games/status quo/AND/development trends/mobile games are/Most/important/1/large/keywords /Shanda/GAME/president/Qian/Donghai/share/Japan/max/mobile/GAME/Company/CEO/prediction/data/2015/year/global/Game /industry/pattern/medium/80/All/mobile phone/GAME/http/t/cn/zTHdkFY
Mmseg
Use/data/tell/your mobile/GAME/have multiple/hot/today/AS/current/gmic/part/ggs/global/mobile/GAME/summit/ hold/guest/AND/GAME/developer/people/discussion/mobile/GAME/current situation/AND/development/trend/mobile/GAME/is/Most /important/major/key/word/Grand/GAME/president/Qian/Donghai/share/Japan/Largest/mobile/GAME/Company/ceo/prediction/ /data/2015/year/global/GAME/industry/pattern/medium/80/All/mobile/GAME/http/t/cn/zthdkfy
Evaluation
Pangu's Word Segmentation should be ideal, and mmseg's "your hand" Word Segmentation is not ideal;
Example 2 Original
As mentioned in an article a few days ago, during batch insertion, Context. Configuration. AutoDetectChangesEnabled = false must be added;
In the original article, EF automatically tracks data changes by default. When the volume of changed data is large, EF's tracking workload will surge, however, the specified operation becomes very slow (this is also a suspicion of some students who suspect EF performance issues). In fact, as long as automatic update is disabled during batch operations, to solve the slow problem.
Let's take a look at it ourselves: http://www.cnblogs.com/guomingfeng/archive/2013/05/28/mvc-ef-repository.html because there is no test, so I don't know whether the result is faster, faster
Results In the morning I found the home page a test Entity Framework of the article, http://www.cnblogs.com/newton/archive/2013/06/06/3120497.html said EF performance is not good, just did not add this sentence, I am also curious, so I test, write this article is not for the author, just curious and discuss with each other
Pangu
A few days ago/saw/an article/mentioned/In/batch/inserted/accessed/required/added/Context/Configuration/AutoDetectChangesEnabled/=/false/article/original /EF/default/automatic/tracking/data/changes/when/change/data volume/large/when/EF/tracking/ workload/will/surge/but/specify/Operation/become/very/slow/This/also/part/students/suspect/EF/performance/problem/ one/suspect/point /) /actually/as long as/In/batch/Operation/when/update/Close/CAN/solve/slow/problem/everyone/yourself/go/check/ http/www/cnblogs/com/guomingfeng/archive/2013/05/28/mvc/ef/repository/html/due to/no/test/SO/no/know/result/is it/ change/fast/How many/results/morning/Discovery/homepage/One/test/Entity/Framework/article/http/www/cnblogs/com/newton/archive/ 2013/06/06/3120497. /html//EF/performance/no/exactly/not added this sentence/ME/also/curious/SO/yourself/test/down/Write/This/article/not/ /author/just/yourself/curious/mutual/discussion
Mmseg
Previous/days/saw/article/In/mentioned/In/batch/inserted/required/added/context/configuration/autodetectchangesenabled/false/article/original /ef/default/automatic/tracking/data/changes/when/change/data/volume/large/when/ef/ tracking/work/volume/will/surge/but/specify/Operation/become/very/slow/This also/Yes/part/students/suspect/ef/performance/ problem/One/suspect/point/Actual/as long as/batch/Operation/AT/time/automatic/update/Close/CAN/solve/slow/ question/everyone/View/http/www/cnblogs/com/guomingfeng/archive/2013/05/28/mvc/ef/repository/html/AS/no/test/ so/do not know/result/no/change/fast/How much/result/morning/Discovery/homepage/One/test/entity/framework/Article /http/www/cnblogs/com/newton/archive/2013/06/06/3120497/html//ef/performance/no/exactly/no/Add/This sentence/I also/curious /So/yourself/test/down/Write/This/article/not/For/author/just/yourself/curious/mutual/discussion
Evaluation
The mmseg word segmentation is finer, and the word segmentation for "a few days ago" and "I don't know if the result is much faster" is better than pangu.
Summary
There are not many use cases for the final word segmentation evaluation. For more use cases, you can download the source code below and test it by yourself. Pangu on the word divider is more complete and rich in functions. mmseg is more about algorithm implementation, but lacks functions.
Download