Problems in the MMSEG algorithm

Source: Internet
Author: User
The complex maximal matching algorithm in Mmseg I don't understand, can someone explain it to me with an example?

Reply content:

The complex maximal matching algorithm in Mmseg I don't understand, can someone explain it to me with an example?

First of all, defining mmseg an important concept is Chunk, a word that contains 3 words.
The MMSEG algorithm is based on a conventional model, so the algorithm is also derived from the analysis and adaptation of the language library.

He has four rules:
Rule 1: Maximum match Maximum matching (the word contains the most chunk)
Rule 2: Maximum average vocabulary length largest average word lengths (the chunk with the most average number of words)
Rule 3: Minimum verb variance smallest variance of word lengths (the least chunk of the variance of the verb)
Rule 4: Maximum single-character freedom largest sum of degree of morphemic freedom of one-character words

    (取詞頻自由度最大的chunk, 也就是chunk中的詞的詞頻取對數後的和,)

The example is faster to understand:
We use the example of mmseg to "study The Origins of Life", but first of all, cut the Chunk is usually very much related to your own language library (the dictionary), not necessarily cut out with the following Chunk.

    1. Research | Life (length = 3)

    2. Research | Life (length = 4)

    3. Research | health | Life (length = 4)

    4. Research | life | up (length = 5)

    5. Research | life | origin (length = 6)

    6. Postgraduate | life | up (length = 5)

    7. Postgraduate | life | origin (length = 6)

Then use the four rules to match them in turn-

According to Rule 1, take the maximum length of the Chunk

    1. Research | life | origin (length = 6, average length = 2)

    2. Postgraduate | life | origin (length = 6, average length = 2)

According to the rules 2, take average length the largest Chunk

    1. Research | life | origin (length = 6, average length = 2, Variance = 0)

    2. Postgraduate | life | origin (length = 6, average length = 2, Variance = 4/9)

According to rule 3, take variance the smallest Chunk

    1. Research | life | origin (length = 6, average length = 2, Variance = 0)

Because there's a Chunk left, the rules 4 don't have to match.
The final result is "research | Life | Origin

If you need to use rules 4, the word in the phrase is to be yourself in your resignation code to define each word in the resignation code at some point in your information and at some point in time.
For example, number 5th Chunk: "Research" frequency = 3, "life" frequency = 5, "origin" of the word frequency =7
The and = Ln3+ln5+ln7 of the numbers
Other Chunk also use the same algorithm to calculate their pairs and
Finally, so the rest of the Chunk than to who the biggest to take who can.

But if the final match of four rules results in more than one Chunk, then mmseg will fail.

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.