Intermediary transaction http://www.aliyun.com/zixun/aggregation/6858.html ">seo diagnose Taobao guest cloud host technology Hall
The third: The further analysis of Baidu Word segmentation algorithm
Said above, after analysis of the Baidu Word segmentation system using bidirectional maximum matching participle, but later found that the reasoning process there is a loophole, and the derivation of the Baidu segmentation algorithm step is too cumbersome, so further analysis to see whether the previous derivation has errors.
So what's the loophole in the previous analysis? We deduce that Baidu participle has the reverse maximum matching based on Baidu will be "Beijing Hua Yun" Word for < North, the Jinghua smoke, from here to see as if the reverse maximum matching, because the results of the maximum matching should be < Beijing, China, smoke ";" But it is inferred that Baidu has adopted a two-way maximum match is too hasty, we have also said that the previous article, Baidu has two dictionaries, a common dictionary, a proprietary dictionary, but also the vocabulary of the proper dictionary to split, and then the remaining pieces to the common dictionary to cut. So the above "Beijing Hua Smoke" is divided into < North, the Jinghua Cloud, another may be: Jinghua Smoke This vocabulary is stored in a proprietary dictionary, so first analysis, so that "Jinghua smoke", leaving the "North", nothing good to cut, so the output < North, the Jinghua smoke.
This is just hypothetical, so is it true that the "Jinghua Cloud" is in a proprietary dictionary? Let's take another example. "Shandong Beijing Hua Yun", the result of Baidu segmentation is < Shandong, North, Jinghua Smoke, if the "Jinghua smoke" in the ordinary dictionary, if it is reverse segmentation, then the result should be < Mountain, Northeast, In the case of the Peking Cloud, if the forward segmentation should be < Shandong, Beijing, China, cloud, no matter what the < Shandong, North, Jinghua smoke. What does that mean? The "Jinghua smoke" is in that proprietary dictionary, so first cut out "Jinghua smoke", and then the remaining "Shandong North" By a common dictionary, is obviously the result of a positive maximum matching output < Shandong, North. Of course, according to our algorithm in the first article deduced that "Shandong North" segmentation will also be drawn < Shandong, North > conclusion, but obviously more than the positive maximum matching more than a few judgment steps, since the effect is the same, A more concise approach can make sense, of course, by choosing the easy way. So the preliminary judgment Baidu takes is the positive maximum match.
We continue to test the use of the word segmentation algorithm, in order to reduce the impact of the first participle of the proprietary dictionary, then the query can not appear relatively special vocabulary, building a query "days to scale", there should be no proprietary dictionary appeared in the vocabulary, Baidu cut into < genius, energy, class; Appears to be the result of a positive maximum match. In addition, if all the query words appear in the proprietary dictionary, what is the method adopted? So the first thing to ensure that words are in a proprietary dictionary, so to ensure this? We construct the query "shop Chen side", Baidu cut into < shop, Chen, Fang, You can see that "Chen" is in the proprietary dictionary, so first cut out. Another example "capital of Shandong Province", Baidu is divided into < Shandong, the capital, said "Tokyo" is in the ordinary dictionary. OK, construction query "Chen Jinghua Smoke", through the analysis can be seen in the previous two words are in the proprietary dictionary, Baidu cut into < Chen, the Jinghua cloud, the word for the proprietary dictionary is also to take a forward maximum matching or two-way maximum matching. Then use the reverse maximum match? Constructing query examples "Chen Xiao East unbeaten", first of all, we are sure that "Chen" and "Oriental Invincible" are in a proprietary dictionary appears, if it is a forward segmentation, then it should be < Chen, square, unbeaten > or < Chen, square, no, fail > if it is the reverse segmentation is the < Chen, Xiao, Oriental invincible , you can see that Baidu's segmentation is < Chen, side, unbeaten > or < Chen, side, no, failure, the description is using the forward maximum match. Through analysis, Baidu's dictionary does not contain the word "undefeated", so in fact, Baidu's segmentation results are < Chen, Fang, not , it is obvious that this and we have previously deduced the algorithm is contradictory, so the previous analysis algorithm does have problems, so the conclusion is that Baidu is to take the forward maximum matching algorithm.
Summed up Baidu's word segmentation system: first, using a proprietary dictionary with the largest forward matching participle, cut off part of the results, the remaining did not split to the general dictionary, the same take forward the largest matching participle, the final output results.
In addition, Google is also using the forward maximum matching word segmentation algorithm, but there seems to be no special dictionary, so many proper names have been chopped up.
From this point of view, Google in the Chinese dictionary construction than Baidu, but also need to add the strength to do, but this is not how difficult.