Mmseg Chinese Word Segmentation Algorithm

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Java has some open-source word segmentation projects, such as Ik, paoding, and mmseg4j. Here we mainly talk about the mmseg Algorithm Used in mmseg4j. Its original article is introduced at: http://policy.chtsai.org/mmseg/, which is written in English. This is just a Chinese character.

Why Chinese Word Segmentation
Chinese and English are written in different ways. Words in English are separated by spaces, and each word represents a meaning (of course there are phrases, but this proportion is not the main one ); chinese is written together, and a single word is often combined with adjacent words to represent a meaning. For example, a "Middle School Principal" cannot be divided into two words, namely, "Middle School" and "principal. If we use a space to separate words, we don't need to separate them.

I. Division rules
The query statement is divided into three chunks, each of which contains one word. This word is a word or phrase (multiple words) in the dictionary ), you can use this rule to divide the remaining unpartitioned substatements.
Why does a chunk consist of three words instead of other numerical values?
Maybe the basic structure of Chinese sentences is the subject and the object, but it is not classified as more words, but a compromise between program accuracy and performance.
For example, the following chunks are available:
1. It's coming soon
2. It's coming soon
3. You can see it.
4. Eye Recognition
5. Just watch

Ii. filter rules
From the preceding division, we can see that there are multiple chunks. to select a unique chunk, we use four rules to filter them separately. Of course, if one chunk is left in a filter rule, and then exit the filtering. If four rules are used and more than one chunk is left, an exception is thrown. These four rules are: maximum matching, maximum average word length, Minimum word length variance, and the maximum degree of freedom of the single word's phoneme.

2.1. Maximum matching
Add the number of words in the three words, and take the chunk with the longest total word length.
The first chunk above is 6 characters in length, so take it.

2.2. Maximum Average Word Length
That is, the total number of words in the chunk divided by the number of words, for example:
1. Internationalization
2. Internationalization
3. Internationalization
The average length of the three chunks is 1.

2.3. Minimum word length variance
First, let's recall what is variance?
Variance is the mean of the sum of squares of the difference between each data and the sample average. formula:
[1/(n-1)] [(x1-s) ^ 2 + (x2-s) ^ 2 +... + (Xn-S) ^ 2], where S is the standard value.
Variance is used to measure X1 ~ The deviation between XN population and S. The smaller the variance, the value X1 ~ The more group XN and S are gathered. When X1 ~ When XN is equal to S, the variance value is 0, indicating that they are focused on a point.

For example, there are two chunks
1. Children
2. Girls

X1 ~ XN is the length of each word. The standard value is the average Word Length in chunk.
The value of the first chunk is:
[(2-5/3) ^ 2 + (2-5/3) ^ 2 + (1-5/3) ^ 2]/3 = [(1/3) ^ 2 + (1/3) ^ 2 + (-2/3) ^ 2]/3 =
(0.1111 + 0.1111 + 0.4444)/3 = 0.2222
The value of the second Chunk is:

[(1-5/3) ^ 2 + (3-5/3) ^ 2 + (1-5/3) ^ 2]/3 = [(-2/3) ^ 2 + (4/3) ^ 2 + (-2/3) ^ 2]/3 =
(0.4444 + 1.7777 + 0.4444)/3 = 0.8888
So take 1st chunks. Of course, this rule is incorrectly filtered in this example, because the second chunks is closer to the original intent.
Why is the least difference between the two parties, because the probability of such a right selection is relatively high.

2.4 The degree of freedom and
Take the word with the number of words in the chunks as 1 and calculate the sum of the degree of freedom of the words. Take the chunk with the highest degree of freedom of the phoneme. A high-frequency Chinese character is more likely to be a single word, which also has a higher degree of freedom of the phoneme. The frequency of this word is calculated in advance and recorded in a predefined dictionary. For example:
1. Considerations
2. Top priority
In 1 chunks, the degree of freedom of the "to" element is 13.84, while in 2, the degree of freedom of the "to" element is 13.64, it indicates that "yes" is more likely to be used as a single word, so select the first chunk here. Of course, this algorithm is also selected here.

The formula for calculating the degree of freedom in mmseg is:
Freq = (INT) (math. Log (integer. parseint (rate) * 100)
The purpose of this formula is to give words with little difference in frequency values the same degree of freedom.

We can see from the above that the mmseg algorithm is not completely accurate, the official said: "In a sample composed of 1013 words, the correct recognition rate of this system has reached 98.41% ." Currently, no algorithm can achieve 100% accuracy, because the language is too complex for computers.

Source code: http://code.google.com/p/pymmseg-cpp/downloads/detail? Name1_pymmseg-cpp-win32-1.0.1.tar.gz & can = 2 & Q =

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Mmseg Chinese Word Segmentation Algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Mmseg Chinese Word Segmentation Algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support