The complex maximal matching algorithm in Mmseg I don't understand, can someone explain it to me with an example?
Reply content:
The complex maximal matching algorithm in Mmseg I don't understand, can someone explain it to me with an example?
First of all, defining mmseg an important concept is Chunk, a word that contains 3 words.
The MMSEG algorithm is based on a conventional model, so the algorithm is also derived from the analysis and adaptation of the language library.
He has four rules:
Rule 1: Maximum match Maximum matching (the word contains the most chunk)
Rule 2: Maximum average vocabulary length largest average word lengths (the chunk with the most average number of words)
Rule 3: Minimum verb variance smallest variance of word lengths (the least chunk of the variance of the verb)
Rule 4: Maximum single-character freedom largest sum of degree of morphemic freedom of one-character words
(取詞頻自由度最大的chunk, 也就是chunk中的詞的詞頻取對數後的和,)
The example is faster to understand:
We use the example of mmseg to "study The Origins of Life", but first of all, cut the Chunk is usually very much related to your own language library (the dictionary), not necessarily cut out with the following Chunk.
Research | Life (length = 3)
Research | Life (length = 4)
Research | health | Life (length = 4)
Research | life | up (length = 5)
Research | life | origin (length = 6)
Postgraduate | life | up (length = 5)
Postgraduate | life | origin (length = 6)
Then use the four rules to match them in turn-
According to Rule 1, take the maximum length of the Chunk
Research | life | origin (length = 6, average length = 2)
Postgraduate | life | origin (length = 6, average length = 2)
According to the rules 2, take average length the largest Chunk
Research | life | origin (length = 6, average length = 2, Variance = 0)
Postgraduate | life | origin (length = 6, average length = 2, Variance = 4/9)
According to rule 3, take variance the smallest Chunk
Research | life | origin (length = 6, average length = 2, Variance = 0)
Because there's a Chunk left, the rules 4 don't have to match.
The final result is "research | Life | Origin
If you need to use rules 4, the word in the phrase is to be yourself in your resignation code to define each word in the resignation code at some point in your information and at some point in time.
For example, number 5th Chunk: "Research" frequency = 3, "life" frequency = 5, "origin" of the word frequency =7
The and = Ln3+ln5+ln7 of the numbers
Other Chunk also use the same algorithm to calculate their pairs and
Finally, so the rest of the Chunk than to who the biggest to take who can.
But if the final match of four rules results in more than one Chunk, then mmseg will fail.