At present, most of the word segmentation system is based on Chinese dictionary matching algorithm, of which the most common is the maximum matching algorithm (Maximum Matching, hereinafter referred to as the MM algorithm), and the MM algorithm has three kinds: a positive maximum matching, a reverse maximum matching and two-way matching. In this paper, the forward maximum matching algorithm is used as an example to introduce its basic idea and realization.
First, the basic idea
(1) Suppose the longest word in the dictionary is W (usually set to 8 characters, or 4 characters).
(2) to determine if the sentence length is greater than w word, if greater than w then skip to (3), if less than W then jump to (6).
(3) Take the first W-word of the sentence to be participle.
(4) Look for W in the dictionary, if present, remove W from the statement, and repeat the above process from the word "w" in the statement.
(5) If it does not exist, remove the last word of the W Word.
(6) Check whether it is a word or null, if it is, then exit.
(7) If not, continue to judge whether the word is present in the thesaurus, so that it repeats until a word is output.
(8) Continue to take the phrase of the first W word repeated loop, so you can divide a sentence into a combination of words.
Second, simple implementation
The code is as follows |
Copy Code |
#include <stdio.h> #include <string> #include <set> using namespace Std; Set<string> g_setworddictionary; int construct () { G_setworddictionary.insert ("China"); G_setworddictionary.insert ("Chinese"); G_setworddictionary.insert ("New York"); G_setworddictionary.insert ("Beijing"); } BOOL Match (String &word) { Set<string>::iterator itor = G_setworddictionary.find (word); if (Itor = = G_setworddictionary.end ()) { return false; } return true; } void Forward_maximum_matching (string content, set<string> &keywords) { #define Max_len 12//Thesaurus the longest word (Utf-8 3 bytes of a kanji) #define Min_len 3//Word (principle ibid.) int len = Content.length (); int right_len = len; int start_pos = 0; BOOL ret = FALSE; String kw_value = ""; int kw_len = 0; int kw_pos = 0; Word or empty string while (Right_len > Min_len) { Statement is greater than the longest word in a thesaurus if (Right_len >= Max_len) { Kw_value = Content.substr (Start_pos, Max_len); } Statement is less than the longest word in a thesaurus Else { Kw_value = Content.substr (Start_pos, Right_len); } Word Library Matching ret = match (kw_value); Kw_len = Kw_value.length (); Kw_pos = 0; while (!ret && kw_len > 2*min_len) { Remove the candidate right one Chinese character Kw_len-= Min_len; Kw_value = Kw_value.substr (Kw_pos, Kw_len); Continue to match ret = match (kw_value); } Match to Word if (ret) { Keywords.insert (Kw_value); Remove a matching word from a statement Start_pos + = Kw_len; Right_len = Len-start_pos; } does not match to word, move down one word Else { Start_pos + = Min_len; Right_len = Len-start_pos; } }//while (Right_len > Min_len) } int main () { Construct a Word Library Construct (); Segmentation Word Library String content = "I am a Chinese, I am a Chinese from Beijing, China, work in New York"; set<string> keywords; forward_maximum_matching (content, keywords); Set<string>::iterator Itor; Output participle For (Itor=keywords.begin (); Itor!=keywords.end (); ++itor) { printf ("Result:%sn", (*itor). C_STR ()); } return 0; } |