I remember that in, I learned Lucene. Net for a while and wrote a word segmentation. That is relatively simple, that is, adding a dictionary to the largest reverse matching. However, the hard disk broke down earlier this year. All data is lost.
Last week, a company project used search. I naturally thought of Lucene again. And found a Chinese Word Segmentation called MMSeg. According to the official statement, the accuracy rate is as high as 98%.
Self mouth, please read the original article: http://www.solol.org/projects/mmseg/
No nonsense:
In fact, MMSeg word segmentation is easy to understand, mainly including chunk and four rules.
Chunk:
A chunk is a word splitting method. It includes an array of entries and four attributes.
For example:
There are at least two types of "study life" Matching:
Research/life VS graduate student/life
This is Two chunks.
A chunk has four important attributes:
Length: The sum of the word lengths in the chunk. Here the two chunk lengths are 4.
The Code is as follows:
Public int getLength (){
If (length =-1 ){
Length = 0;
For (int I = 0; I <words. length; I ++ ){
Length + = words [I]. getLength ();
}
}
Return length;
}
Average Length: length/number of words. 4/2 = 2
The Code is as follows:
Public double getAverageLength (){
If (averageLength =-1D ){
AverageLength = (double) getLength ()/(double) words. length;
}
Return averageLength;
}
Standard Deviation square: sum of the square of the length of each entry in the chunk minus the mean length difference, and then divided by the number of entries. Alas, this sentence is really hard to understand. Hahaha read the code:
Public double getVariance (){
If (variance =-1D ){
Double tempVariance = 0D;
For (int I = 0; I <words. length; I ++ ){
Double temp = (double) words [I]. getLength ()-getAverageLength ();
TempVariance + = temp * temp;
}
Variance = Math. sqrt (tempVariance/(double) words. length );
}
Return variance;
}
Free Degree: The sum of the logarithm of Word Frequency of each word.
Check the Code:
Public double getDegreeOfMorphemicFreedom (){
If (degreeMorphemicFreedom =-1D ){
DegreeMorphemicFreedom = 0D;
For (int I = 0; I <words. length; I ++ ){
If (words [I]. getLength () = 1 ){
DegreeMorphemicFreedom + = Math. log (double) words [I]. getFrequency ());
}
}
}
Return degreeMorphemicFreedom;
}
After understanding the most important concept of chunk, we need to understand four rules.
Rule 1: obtain the Maximum matched chunk (Rule 1: Maximum matching)
This rule is easy to understand, that is, the longest chunk length. Note.
Rule 2: Obtain the chunk with the Largest average term length (Rule 2: Largest average word length)
This rule is better understood, that is, the maximum chunk length on average.
Rule 3: Take chunk (Rule 3: Smallest variance of word lengths) with the Smallest word Length Standard Deviation)
If this is not mentioned, the corresponding property is getVariance, which is the minimum value.
Rule 4: Obtain the chunk (Rule 4: Largest sum of degree of morphemic freedom of one-character words) with the Largest degree of free morphology of a single word)
Similarly, the corresponding property is getDegreeOfMorphemicFreedom, and the maximum value is obtained.
If the number of chunks is greater than 1 after filtering by these four rules, this word segmentation will be powerless. We need to expand it ourselves. In fact, it is easy to expand it. Just come up with another rule. Hahaha. Of course, the calling methods for self-extended rules are slightly different.
Not much nonsense:
Java code: http://files.cnblogs.com/bqrm/mmseg-v0.3.zip
. Net: http://files.cnblogs.com/bqrm/mmseg-v0.1.net.zip