Mmseg is a common dictionary-Based Word Segmentation Algorithm in Chinese word segmentation (author's homepage: http://chtsai.org/index_tw.html), simple, relatively good effect. Because of its simplicity and intuition, the implementation is not very complicated and the running speed is also relatively fast. For the original algorithm, see: http://technology.chtsai.org/mmseg/
In general, the Chinese word segmentation algorithm can be divided into two categories: dictionary-based and non-dictionary-based.
Dictionary-Based Word Segmentation Algorithms are common, such as forward/reverse maximum matching and minimum segmentation (minimum number of words in a sentence. In specific use, multiple algorithms are usually used together, or are mainly used and supplemented, and some attributes such as part of speech and Word Frequency are added to assist in processing (some simple mathematical models are used ).
Non-dictionary-based algorithms, generally used probability statistics, machine learning and other methods, the current common is CRF (Conditional Random Field, http://en.wikipedia.org/wiki/Conditional_random_field ). Such methods allow computers to "Learn" how to split words based on existing materials. Specific implementation can be referred to (http://nlp.stanford.edu/software/segmenter.shtml ).
In general, these two methods have their own advantages and disadvantages: The dictionary-based method is easy to implement and deploy, but the word segmentation accuracy is limited, and the recognition of unregistered words (words not in the dictionary) is poor; the non-code-based method is fast and has good effect on Unlogged word recognition. It can achieve high word segmentation Accuracy Based on the field of use, but the implementation is complicated, it usually requires a lot of preliminary work.
Mmseg is a dictionary-based word segmentation algorithm that focuses on positive maximum matching, supplemented by multiple ambiguity elimination rules. Let's take a look:
According to the author's explanation in the original article, mmseg is divided into two parts: "matching algorithm" and "Ambiguity Resolution Rules. The "matching algorithm" refers to how to match the statements to be split Based on the words saved in the dictionary (positive? Reverse? Granularity ?); The "rule for eliminating ambiguity" refers to the rule used to determine which method to use when a sentence can be divided in this way or in that way, for example, the phrase "Facilities and Services" can be divided into "facilities _ and _ service", or "facilities _ and _ service". Which word segmentation result can be selected, is the function of "clearing ambiguity Rules.
There are two matching methods for mmseg:
1.Simple Method, That is, a simple forward match. All possible results are listed based on the characters at the beginning. For example, you can get
One
Yi Jin
One strength
A strong
The four matching results (assuming these four words are included in the dictionary ).
2.Complex MethodMatch all the "three-word phrases" (Chunk is used in the original text, and it is more appropriate to use the "phrase" here), that is, from a given word to the starting position, obtain all possible combinations of "Three words as a group. For example, you can obtain
Researcher
Research _ research _ life
Graduate Student _ life _ Origin
Study _ life _ Origin
These "phrases" (more than this may be in the dictionary, for example only)
There are four "Rules for eliminating ambiguity". These four rules are used in sequence for filtering until only one result or the fourth rule is used. The four rules are as follows:
1.Maximum matching(Maximum match), which corresponds to the matching method using "simple" and "complex" respectively. For the "simple" matching method, select the word with the maximum length. In the above example, select "one strong". For the "complex" matching method, select the phrase with the largest length, and then select the first word of the phrase as the first word to be split, in the above example, "graduate student" in "graduate student _ life _ origin" or "research" in "study _ life _ origin ".
2.Largest Average Word Length(Maximum mean word length ). After filtering by rule 1, if more than one remaining phrase exists, select the one with the largest average term length (average term length = total phrase count/term count ). For example, "Living Standard" may get the following phrase:
Raw _ living water _ Ping (4/3 = 1.33)
Life _ Shui _ Ping (4/3 = 1.33)
Living _ level (4/2 = 2)
According to this rule, you can determine the phrase "living _ level ".
3.Smallest variance of word lengths(Minimum change rate of word length), because the change rate of word length can be reflected by the standard deviation (http://baike.baidu.com/view/78339.htm), so the standard deviation formula can be directly applied here. For example
Study _ life _ origin (standard deviation = SQRT (2-2) ^ 2 + (2-2) ^ 2 + (2-2 ^ 2)/3) = 0)
Graduate _ life _ origin (standard deviation = SQRT (2-3) ^ 2 + (2-1) ^ 2 + (2-2) ^ 2)/3) = 0.8165)
So I chose the phrase "study _ life _ origin.
4.Largest sum of degree of morphemic freedom of One-character wordsHere, degree of morphemic freedom can be expressed by a mathematical formula: log (frequency), that is, the natural logarithm of Word Frequency (here log represents the LN in mathematics ). This rule indicates that "Calculate the natural logarithm of Word Frequency of all single words in a phrase, then add the obtained values, and take the phrase with the largest sum ". For example:
Facilities and services
Facilities _ and _ services
The two phrases are separated by the word "service" and "and". Assume that the frequency of "service" as a single word is 5, the frequency of "and" is 10 when a single word is used. The natural logarithm of "5" and "10" is used, and the maximum value is used. Therefore, the phrase where "and" are used is used, that is, "facilities _ and _ service ".
I may ask why I want to obtain the natural logarithm of "Word Frequency? It can be understood that the total word frequency in a single phrase may be the same, but the actual effect is different, for example
A_bbb_c (Word Frequency, A: 3, C: 7)
Dd_e_f (Word Frequency, E: 5, F: 5)
It indicates two phrases. A, C, E, and F indicate different single words. If the natural logarithm is not used, the word frequency is used for calculation, the two phrases are the same (3 + 7 = 5 + 5), but in fact the effects of different word frequency ranges are different, so here we take the natural logarithm, table-based differentiation (Ln (3) + Ln (7) <Ln (5) + Ln (5), 3.0445 <3.2189 ).
Among the four filter rules, if the simple matching method is used, only the first rule can be used for filtering. If the complex matching method is used, all four rules can be used. In actual use, complex's matching method and four rules are generally used for filtering. (The Simple Matching Method is actually the maximum positive matching, but this method is rarely used in practice)
The mmseg word segmentation method may be roughly understood here. As described at the beginning of the article, it is an intuitive word segmentation method. It splits a sentence "as long as possible (the length here refers to the length of the word to be split as long as possible)" and "as even as possible", just imagine, it is similar to Chinese grammar habits. Mmseg is a simple, feasible, and fast method if the requirements for word segmentation accuracy are not particularly high.
When implementing a word splitting program, consider the following:
1. "method" Determines "Speed ". In the dictionary-based word segmentation algorithm, the dictionary structure has a great influence on the speed (the dictionary structure generally determines the matching method and speed ). There are many methods to construct a dictionary. For example, "first-word index + second-word segmentation", index the first word of all words with a hash algorithm, and then sort part of the word body, use binary search. This method is feasible, but not the fastest. For such Dictionary matching, the trie structure is generally the first choice. Trie also has some variants and implementation methods for matching a large amount of static data (for example, the content in a dictionary is rarely modified once it is created, so it is called "static "), the double array trie tree structure (double array Trie
Tree. Here are a few class libraries for reference:
Darts, http://chasen.org /~ Taku/software/DARTS/, C ++
Darts-clone, http://code.google.com/p/darts-clone/, C ++, some aspects better than darts
2. The word segmentation effect of mmseg is closely related to the dictionary (which words are in the dictionary and the accuracy of Word Frequency), especially the frequency of words in the dictionary. You can customize dictionaries (such as the Computer dictionary, life information dictionary, and tourism dictionary) based on the field of use to achieve better word segmentation. You can also use a dictionary to perform some special purposes (such as address segmentation ). On the word library, you can refer to the "sogou" cell Dictionary (http://pinyin.sogou.com/dict/) and the corpus provided by it (can be divided according to the corpus, statistical word frequency of a certain aspect, http://www.sogou.com/labs/resources.html ).
3. The processing of Chinese word segmentation has a great relationship with the encoding (GBK, gb2312, big5, UTF-8), generally based on UTF-8, reduce the complexity of coding.
4. Obtaining all the "chunks" in the mmseg algorithm is a complicated part. There are different methods based on different dictionary structures. If you use the "Double array trie Tree" structure, it will be simpler. You can use "recursion" or "three-layer for loop implementation ". For performance considerations, the for loop is generally used.
Here is a PHP Chinese Word Segmentation extension http://code.google.com/p/xsplit/ Based on mmseg algorithm and added some common functions