MMSeg Word Segmentation

Source: Internet
Author: User

I remember that in, I learned Lucene. Net for a while and wrote a word segmentation. That is relatively simple, that is, adding a dictionary to the largest reverse matching. However, the hard disk broke down earlier this year. All data is lost.

Last week, a company project used search. I naturally thought of Lucene again. And found a Chinese Word Segmentation called MMSeg. According to the official statement, the accuracy rate is as high as 98%.

 

Self mouth, please read the original article: http://www.solol.org/projects/mmseg/

 

No nonsense:

In fact, MMSeg word segmentation is easy to understand, mainly including chunk and four rules.

Chunk:

A chunk is a word splitting method. It includes an array of entries and four attributes.

For example:

There are at least two types of "study life" Matching:

Research/life VS graduate student/life

This is Two chunks.

A chunk has four important attributes:

Length: The sum of the word lengths in the chunk. Here the two chunk lengths are 4.

The Code is as follows:

Public int getLength (){
If (length =-1 ){
Length = 0;
For (int I = 0; I <words. length; I ++ ){
Length + = words [I]. getLength ();
}
}

Return length;
}

Average Length: length/number of words. 4/2 = 2

The Code is as follows:

Public double getAverageLength (){
If (averageLength =-1D ){
AverageLength = (double) getLength ()/(double) words. length;
}

Return averageLength;
}

 

Standard Deviation square: sum of the square of the length of each entry in the chunk minus the mean length difference, and then divided by the number of entries. Alas, this sentence is really hard to understand. Hahaha read the code:

Public double getVariance (){
If (variance =-1D ){
Double tempVariance = 0D;
For (int I = 0; I <words. length; I ++ ){
Double temp = (double) words [I]. getLength ()-getAverageLength ();
TempVariance + = temp * temp;
}

Variance = Math. sqrt (tempVariance/(double) words. length );
}
Return variance;
}

 

Free Degree: The sum of the logarithm of Word Frequency of each word.

Check the Code:

Public double getDegreeOfMorphemicFreedom (){
If (degreeMorphemicFreedom =-1D ){
DegreeMorphemicFreedom = 0D;
For (int I = 0; I <words. length; I ++ ){
If (words [I]. getLength () = 1 ){
DegreeMorphemicFreedom + = Math. log (double) words [I]. getFrequency ());
}
}
}
Return degreeMorphemicFreedom;
}

 

After understanding the most important concept of chunk, we need to understand four rules.

Rule 1: obtain the Maximum matched chunk (Rule 1: Maximum matching)

This rule is easy to understand, that is, the longest chunk length. Note.

Rule 2: Obtain the chunk with the Largest average term length (Rule 2: Largest average word length)

This rule is better understood, that is, the maximum chunk length on average.

Rule 3: Take chunk (Rule 3: Smallest variance of word lengths) with the Smallest word Length Standard Deviation)

If this is not mentioned, the corresponding property is getVariance, which is the minimum value.

Rule 4: Obtain the chunk (Rule 4: Largest sum of degree of morphemic freedom of one-character words) with the Largest degree of free morphology of a single word)

Similarly, the corresponding property is getDegreeOfMorphemicFreedom, and the maximum value is obtained.

 

If the number of chunks is greater than 1 after filtering by these four rules, this word segmentation will be powerless. We need to expand it ourselves. In fact, it is easy to expand it. Just come up with another rule. Hahaha. Of course, the calling methods for self-extended rules are slightly different.

 

Not much nonsense:

Java code: http://files.cnblogs.com/bqrm/mmseg-v0.3.zip

. Net: http://files.cnblogs.com/bqrm/mmseg-v0.1.net.zip

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.