MMSeg Word Segmentation

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I remember that in, I learned Lucene. Net for a while and wrote a word segmentation. That is relatively simple, that is, adding a dictionary to the largest reverse matching. However, the hard disk broke down earlier this year. All data is lost.

Last week, a company project used search. I naturally thought of Lucene again. And found a Chinese Word Segmentation called MMSeg. According to the official statement, the accuracy rate is as high as 98%.

Self mouth, please read the original article: http://www.solol.org/projects/mmseg/

No nonsense:

In fact, MMSeg word segmentation is easy to understand, mainly including chunk and four rules.

Chunk:

A chunk is a word splitting method. It includes an array of entries and four attributes.

For example:

There are at least two types of "study life" Matching:

Research/life VS graduate student/life

This is Two chunks.

A chunk has four important attributes:

Length: The sum of the word lengths in the chunk. Here the two chunk lengths are 4.

The Code is as follows:

Public int getLength (){
If (length =-1 ){
Length = 0;
For (int I = 0; I <words. length; I ++ ){
Length + = words [I]. getLength ();
}
}

Return length;
}

Average Length: length/number of words. 4/2 = 2

The Code is as follows:

Public double getAverageLength (){
If (averageLength =-1D ){
AverageLength = (double) getLength ()/(double) words. length;
}

Return averageLength;
}

Standard Deviation square: sum of the square of the length of each entry in the chunk minus the mean length difference, and then divided by the number of entries. Alas, this sentence is really hard to understand. Hahaha read the code:

Public double getVariance (){
If (variance =-1D ){
Double tempVariance = 0D;
For (int I = 0; I <words. length; I ++ ){
Double temp = (double) words [I]. getLength ()-getAverageLength ();
TempVariance + = temp * temp;
}

Variance = Math. sqrt (tempVariance/(double) words. length );
}
Return variance;
}

Free Degree: The sum of the logarithm of Word Frequency of each word.

Check the Code:

Public double getDegreeOfMorphemicFreedom (){
If (degreeMorphemicFreedom =-1D ){
DegreeMorphemicFreedom = 0D;
For (int I = 0; I <words. length; I ++ ){
If (words [I]. getLength () = 1 ){
DegreeMorphemicFreedom + = Math. log (double) words [I]. getFrequency ());
}
}
}
Return degreeMorphemicFreedom;
}

After understanding the most important concept of chunk, we need to understand four rules.

Rule 1: obtain the Maximum matched chunk (Rule 1: Maximum matching)

This rule is easy to understand, that is, the longest chunk length. Note.

Rule 2: Obtain the chunk with the Largest average term length (Rule 2: Largest average word length)

This rule is better understood, that is, the maximum chunk length on average.

Rule 3: Take chunk (Rule 3: Smallest variance of word lengths) with the Smallest word Length Standard Deviation)

If this is not mentioned, the corresponding property is getVariance, which is the minimum value.

Rule 4: Obtain the chunk (Rule 4: Largest sum of degree of morphemic freedom of one-character words) with the Largest degree of free morphology of a single word)

Similarly, the corresponding property is getDegreeOfMorphemicFreedom, and the maximum value is obtained.

If the number of chunks is greater than 1 after filtering by these four rules, this word segmentation will be powerless. We need to expand it ourselves. In fact, it is easy to expand it. Just come up with another rule. Hahaha. Of course, the calling methods for self-extended rules are slightly different.

Not much nonsense:

Java code: http://files.cnblogs.com/bqrm/mmseg-v0.3.zip

. Net: http://files.cnblogs.com/bqrm/mmseg-v0.1.net.zip

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More