Summary on the data of mmseg segmentation algorithm

Source: Internet
Author: User
Tags natural logarithm

http://www.byywee.com/page/M0/S602/602088.html, or http://hi.baidu.com/catro/item/5c76247c0ff6a9376f29f6ed. A brief introduction to Mmsegthe explanation of http://blog.csdn.net/yangyan19870319/article/details/6399871 maximum forward matching algorithm

https://pypi.python.org/pypi/mmseg/1.3.0 about mmseg python source download


https://pypi.python.org/pypi/mmseg/1.3.0 simple implementation of mmseg Python

Mmseg is a common dictionary-based word segmentation algorithm in Chinese participle (author's homepage: http://chtsai.org/index_tw.html), which is simple and relatively good in effect. Because of its simple and intuitive, the implementation is not very complex, running speed is relatively fast. For the original text of the algorithm, see: http://technology.chtsai.org/mmseg/

In general now the Chinese word segmentation algorithm, probably can be broadly divided into two categories: a dictionary-based, a non-dictionary-based.

Dictionary-based Word segmentation algorithm is more common, such as forward/inverse maximum matching, minimum segmentation (so that the number of words in a word) and so on. The use of the time, usually a variety of algorithms used, or a main, a variety of supplemented, but also add the part of speech, word frequency and other attributes to assist processing (using some simple mathematical model).

Non-dictionary-based algorithms, generally mainly using probability statistics, machine learning and other aspects of the method, the current common is CRF (Conditional random field, Http://en.wikipedia.org/wiki/Conditional_ Random_field). Such a method allows the computer to "learn" how to participle based on ready-made data. The specific implementation can be referenced (http://nlp.stanford.edu/software/segmenter.shtml).

Generally speaking, these two kinds of methods have advantages and disadvantages: The method based on the dictionary, implementation, deployment is relatively easy, but the accuracy of the word segmentation is not good, and the non-login words (words not found in the dictionary) poor recognition; not a dictionary-based approach, faster, and better recognition of non-signed words, can be used to achieve a higher However, the implementation is more complex and usually requires a lot of upfront work.

Mmseg is a kind of word segmentation algorithm based on dictionary, which is supplemented by the principle of positive maximum matching and many kinds of disambiguation rules. Here's a concrete look:

According to the author's exposition in the original text, the explanation of mmseg is divided into two parts, "matching algorithm (Matching algorithm)" and "disambiguation rule (ambiguity resolution rules)". "Matching algorithm" is how to match the statements to be sliced according to the words saved in the dictionary (positive direction?). Reverse? Particle size? ), "disambiguation rules" is that when a sentence can be such a point, can also be divided, what rules to determine the use of the method, such as "facilities and services" the phrase, can be divided into "facilities _ kimono _", can also be divided into "facilities _ and _ Service", choose which participle results, is "to eliminate ambiguity of the rules "Feature.

There are two types of "matching methods" for mmseg:

1. Simplemethod , which is simply a forward match, lists all possible results based on the word that begins. such as "kept", you can get

One

Keep

Kept

Kept's

These four matching results (assuming that all four words are included in the dictionary).

2.Complex method , match all the "three words of the phrase" (the original use of chunk, here feel with "phrase" more appropriate), that is, from a given word as the starting position, to get all possible "three words for a group" all combinations. For example, "Studying the origins of life" can be

Research _ Students

Research _ Life

The origin of graduate student _ life

Research _ Life _ origin

These "phrases" (depending on the dictionary, may be much more than this, just for example)

There are four rules for disambiguation, which are filtered using the four rules in turn until there is only one result or the fourth rule is used. The four rules are:

1.Maximum matching (maximum match), there are two cases, corresponding to the use of "simple" and "complex" matching method. For the "simple" matching method, select the word with the largest length, use the example above to select "kept"; for the "complex" matching method, select the phrase "max phrase length" and select the first word of the phrase as the first word to cut out, the example above is "graduate _ The "graduate student" in "The Origin of Life", or "research" in "the study of the origin of _".

2.largest average wordlength (maximum average word lengths). After filtering through Rule 1, if there are more than 1 remaining phrases, choose the one with the most average word length (average term = total words/words). For example, "living standards" may be given the following phrases:

Sheng _ Living Water _ Ping (4/3=1.33)

Life _ Water _ Ping (4/3=1.33)

Life _ Level (4/2=2)

According to this rule, you can choose the phrase "life _ Level"

3.smallest variance of Word lengths(minimum rate of change in word length), as the rate of change in Word length can be reflected by the standard deviation (http://baike.baidu.com/view/78339.htm) , so the standard deviation formula can be applied directly here. Like what

Study _ Life _ Origin (standard deviation =sqrt ((2-2) ^2+ (2-2) ^2+ (2-2^2))/3) =0)

Postgraduate _ Life _ origin (standard deviation =sqrt ((2-3) ^2+ (2-1) ^2+ (2-2) ^2)/3) =0.8165)

So choose the phrase "study _ Life _ Origin".

4.largest sum of degree of morphemic freedom of one-character words, of which degree of morphemic Freedom can be expressed in a mathematical formula: log ( Frequency), which is the natural logarithm of the word frequency (here the log represents ln in mathematics). This rule means "to calculate the natural logarithm of the word frequency of all the single words in a phrase, and then add the resulting values and take the largest phrase". Like what:

Facilities _ Kimono _ service

Facilities _ and _ service

The two phrases are "business" and "and" the two single words, assuming "service" as a single word when the frequency is 5, "and" as a single word when the frequency is 10, 5 and 10 take the natural logarithm, and then take the maximum value, so take "and" the phrase, namely "facilities _ and _ service."

Perhaps ask why the "word frequency" to take the natural logarithm? It can be understood that the word frequency sum of single words in a phrase may be the same, but the actual effect is different, such as

A_bbb_c (single word frequency, A:3, C:7)

Dd_e_f (single word frequency, e:5,f:5)

Represents two phrases, a, C, E, F for different single words, if not take the natural logarithm, simply the word frequency to calculate, then the two phrases are the same (3+7=5+5), but in fact, different frequency range expressed the effect is different, so here take the natural logarithm, to table distinguish (ln (3) +LN (7) < ln (5) +LN (5), 3.0445<3.2189).

In this four filter rule, if you use simple matching methods, you can only use the first rule filter, and if you use complex's matching method, then four rules can be used. In practice, it is common to use complex's matching method + four rule filtering. (Simple method of matching is essentially a positive maximum match, which is seldom used in practice)

See here may have a general understanding of the Mmseg Word segmentation method, as described in the beginning of the article, it is an "intuitive" word segmentation method. It put a sentence "as long as possible (here is long, refers to the words of the word as long as possible)" "as evenly as possible," the area of the segmentation, a little imagination, then feel the Chinese grammar with a more consistent. If the accuracy of the word segmentation is not particularly high, mmseg is a simple, feasible and fast method.

In the specific implementation of a word breaker, there are probably the following points to consider:

1. "Method" determines "speed". In the dictionary-based word segmentation algorithm, the structure of the dictionary has a relatively large effect on the speed (the structure of the dictionary generally determines the matching method and speed). In general, there are many ways to construct a dictionary, such as "first word index + whole word two", the first word of all words with a hash algorithm index, and then the word body part of the order, using two points to find. Such a method is feasible, but not the quickest. For such a dictionary match, the trie structure is generally preferred. Trie also have some variants and implementation methods, for a large number of static data matching (such as dictionaries, once created, it is seldom to modify the contents of the content, it is called "static"), the general use of "even group Trie tree structure (Double array trie trees)", the relevant information online there are many, Can Google or Baidu. Here are a few references to the class library:

Darts, http://chasen.org/~taku/software/darts/, C + +

Darts-clone, http://code.google.com/p/darts-clone/, C + +, some aspects better than Darts.

2.MMSEG Word segmentation effect is larger than dictionary (here is what words in the dictionary, and the accuracy of the word frequency), especially the frequency of single words in the dictionary. Can be based on the use of the field, specialized custom dictionaries (such as computer thesaurus, Life information class thesaurus, tourism thesaurus, etc.), as far as possible to subdivide the dictionary, so that the results will be much better. The dictionary can also be used to achieve some special purposes (address participle, etc.). About the thesaurus, you can refer to the "Sogou" Cell thesaurus (http://pinyin.sogou.com/dict/) and its provided corpus (according to its well-divided corpus, statistics on a certain aspect of the word frequency, http://www.sogou.com/labs/ resources.html).

3. Chinese word processing, and coding relationship is very large (GBK, GB2312, BIG5, UTF-8), generally UTF-8 mainly to reduce the complexity of coding.

The 4.MMSEG algorithm obtains all the "chunk" is the more complex part, according to the different dictionary structure may have the different method. If you use the "even group Trie tree" structure, it will be simpler, either "recursive" or "three-layer for loop". For performance reasons, a For loop is generally used.

Here is a php Chinese word extension http://code.google.com/p/xsplit/based on mmseg algorithm and some common functions are added.


Summary on the data of mmseg segmentation algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.