C language to achieve the search engine technology in Chinese participle

Source: Internet
Author: User

At present, most of the word segmentation system is based on Chinese dictionary matching algorithm, of which the most common is the maximum matching algorithm (Maximum Matching, hereinafter referred to as the MM algorithm), and the MM algorithm has three kinds: a positive maximum matching, a reverse maximum matching and two-way matching. In this paper, the forward maximum matching algorithm is used as an example to introduce its basic idea and realization.


First, the basic idea

(1) Suppose the longest word in the dictionary is W (usually set to 8 characters, or 4 characters).
(2) to determine if the sentence length is greater than w word, if greater than w then skip to (3), if less than W then jump to (6).
(3) Take the first W-word of the sentence to be participle.
(4) Look for W in the dictionary, if present, remove W from the statement, and repeat the above process from the word "w" in the statement.
(5) If it does not exist, remove the last word of the W Word.
(6) Check whether it is a word or null, if it is, then exit.
(7) If not, continue to judge whether the word is present in the thesaurus, so that it repeats until a word is output.
(8) Continue to take the phrase of the first W word repeated loop, so you can divide a sentence into a combination of words.

Second, simple implementation

The code is as follows Copy Code
#include <stdio.h>
#include <string>
#include <set>
using namespace Std;
Set<string> g_setworddictionary;

int construct ()
{
G_setworddictionary.insert ("China");
G_setworddictionary.insert ("Chinese");
G_setworddictionary.insert ("New York");
G_setworddictionary.insert ("Beijing");
}

BOOL Match (String &word)
{
Set<string>::iterator itor = G_setworddictionary.find (word);
if (Itor = = G_setworddictionary.end ())
{
return false;
}

return true;
}

void Forward_maximum_matching (string content, set<string> &keywords)
{
#define Max_len 12//Thesaurus the longest word (Utf-8 3 bytes of a kanji)
#define Min_len 3//Word (principle ibid.)
int len = Content.length ();
int right_len = len;
int start_pos = 0;
BOOL ret = FALSE;
String kw_value = "";
int kw_len = 0;
int kw_pos = 0;
Word or empty string
while (Right_len > Min_len)
{
Statement is greater than the longest word in a thesaurus
if (Right_len >= Max_len)
{
Kw_value = Content.substr (Start_pos, Max_len);
}
Statement is less than the longest word in a thesaurus
Else
{
Kw_value = Content.substr (Start_pos, Right_len);
}

Word Library Matching
ret = match (kw_value);
Kw_len = Kw_value.length ();
Kw_pos = 0;
while (!ret && kw_len > 2*min_len)
{
Remove the candidate right one Chinese character
Kw_len-= Min_len;
Kw_value = Kw_value.substr (Kw_pos, Kw_len);
Continue to match
ret = match (kw_value);
}

Match to Word
if (ret)
{
Keywords.insert (Kw_value);
Remove a matching word from a statement
Start_pos + = Kw_len;
Right_len = Len-start_pos;
}
does not match to word, move down one word
Else
{
Start_pos + = Min_len;
Right_len = Len-start_pos;
}
}//while (Right_len > Min_len)
}

int main ()
{
Construct a Word Library
Construct ();

Segmentation Word Library
String content = "I am a Chinese, I am a Chinese from Beijing, China, work in New York";
set<string> keywords;
forward_maximum_matching (content, keywords);
Set<string>::iterator Itor;

Output participle
For (Itor=keywords.begin (); Itor!=keywords.end (); ++itor)
{
printf ("Result:%sn", (*itor). C_STR ());
}

return 0;
}

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.