Chinese
Hehe, using PHP to do Chinese participle is not a sensible move,:p
Below is a dictionary that I find on the Internet, a simple word-breaker program.
(Note: The dictionary file is gdbm format, key is word value is frequency, about 40,000 common words)
Complete program demo and download see: http://root.twomice.net/my_php4/dict/chinese_segment.php
<?php
A simple way to realize Chinese word segmentation system
Tangent unit: A character that <128 the ASCII value
Common Double-byte symbol: "",. 、? “”;:! ¥ ... %$#@^&* () []{}|\/]
Can consider to add the most common Chinese characters: and is not AH (but there are special words such as "dozen" "Zheng He".:p)
Calculation time
function Getmicrotime () {
List ($usec, $sec) = Explode ("", Microtime ());
Return ((float) $usec + (float) $sec);
}
$time _start = Getmicrotime ();
Dictionary class
Class Ch_dictionary {
var $_id;
function ch_dictionary ($fname = "") {
if ($fname!= "") {
$this->load ($fname);
}
}
Load a dictionary from a filename (gdbm data file)
function Load ($fname) {
$this->_id = Dba_popen ($fname, "R", "GDBM");
if (! $this->_id) {
echo "Failed to open the dictionary. ($fname) <br>\n ";
Exit
}
}
//According to Word return frequency, does not exist return-1
function Find ($word) {
$freq = Dba_fetch ($word, $this->_id);
if (Is_bool ($freq)) $freq =-1;
return $freq;
}
}
//Sub-word: (reverse)
/////////////////////////////////////////////// Class Ch_word_split {
var $_mb_mark_list; //(Full-width punctuation for common segmentation sentences
VAR $_word_maxlen; //single word maximum possible length (Chinese characters)
var $_dic; /dictionary ...
var $_ignore_mark; /True or false
function Ch_word_split () {
$this->_mb_mark_list = Array (",", " ","。 ","! ","? ",":","......","、","“","”","《","》","(",")");
$this->_word_maxlen = 12; //12 Chinese characters
$this->_dic = NULL;
$this->_ignore_mark = true;
}
Set up a dictionary
function Set_dic ($fname) {
$this->_dic = new Ch_dictionary ($fname);
}
function Set_ignore_mark ($set) {
if (Is_bool ($set)) $this->_ignore_mark = $set;
}
Cut the string into sentences and cut into words
function String_split ($str, $func = "") {
$ret = Array ();
if ($func = = "" | |! Function_exists ($func)) $func = "";
$len = strlen ($STR);
$qtr = "";
for ($i = 0; $i < $len; $i + +) {
$char = $str [$i];
if (Ord ($char) < 0XA1) {
Read to a half-width character
if (!empty ($qtr)) {
$tmp = $this->_sen_split ($qtr);
$qtr = "";
if ($func!= "") Call_user_func ($func, $tmp);
Else $ret = Array_merge ($ret, $tmp);
}
//If it is a word or a number. According to Char Read data to >= 0XA1
if ($this->_is_alnum ($char)) {
do {
if (($i + 1) >= $len) break;
$char 2 = substr ($str, $i + 1, 1);
if (! $this->_is_alnum ($char 2)) break;
$char. = $char 2;
$i + +;
while (1);
if ($func!= "") Call_user_func ($func, Array ($char));
else $ret [] = $char;
}
ElseIf ($char = = ' | | $char = = ' \ t ') {
Nothing.
Continue
}
ElseIf (! $this->_ignore_mark) {
if ($func!= "") Call_user_func ($func, Array ($char));
else $ret [] = $char;
}
}
else {
Double-byte characters.
$i + +;
$char. = $str [$i];
if (In_array ($char, $this->_mb_mark_list)) {
if (!empty ($qtr)) {
$tmp = $this->_sen_split ($qtr);
$qtr = "";
if ($func!= "") Call_user_func ($func, $tmp);
else $ret = Array_merge ($ret, $tmp);
}
if (! $this->_ignore_mark) {
if ($func!= "") Call_user_func ($func, Array ($char));
else $ret [] = $char;
}
}
else {
$qtr. = $char;
}
}
}
if (strlen ($QTR) > 0) {
$tmp = $this->_sen_split ($qtr);
if ($func!= "") Call_user_func ($func, $tmp);
else $ret = Array_merge ($ret, $tmp);
}
return value
if ($func = = "") {
return $ret;
}
else {
return true;
}
}
Cut the sentence into words and reverse
function _sen_split ($sen) {
$len = strlen ($sen)/2;
$ret = Array ();
for ($i = $len-1; $i >= 0; $i-) {
Such as: This is a word breaker procedure
Get the last word first.
$w = substr ($sen, $i * 2, 2);
The final word length
$wlen = 1;
Start the reverse match to the maximum length.
$LF = 0; Last Freq
for ($j = 1; $j <= $this->_word_maxlen; $j + +) {
$o = $i-$j;
if ($o < 0) break;
$w 2 = substr ($sen, $o * 2, ($j + 1) * 2);
$tmp _f = $this->_dic->find ($w 2);
echo ' {$i}. {$j}: $w 2 (f: $tmp _f) \ n ";
if ($tmp _f > $lf) {
$LF = $tmp _f;
$wlen = $j + 1;
$w = $w 2;
}
}
$i offset according to the $wlen
$i = $i-$wlen + 1;
Array_push ($ret, $w);
}
$ret = Array_reverse ($ret);
return $ret;
}
Determine if the character is not an alphanumeric _-[0-9a-z_-]
function _is_alnum ($char) {
$ord = Ord ($char);
if ($ord = = | | $ord = = 95 | | ($ord >= && $ord <= 57))
return true;
if ($ord >= && $ord <= 122) | | ($ord >= && $ord <= 90))
return true;
return false;
}
}
The callback function after participle
function Call_back ($ar) {
foreach ($ar as $tmp) {
Echo $tmp. " ";
Flush ();
}
}
Instance (read from Sample.txt If there is no input):
$WP = new Ch_word_split ();
$WP->set_dic ("dic.db");
if (!isset ($_request[' Testdat ')) | | empty ($_request[' Testdat '])) {
$data = file_get_contents ("Sample.txt");
}
else {
$data = & $_request[' Testdat '];
}
Output
echo "echo "echo "Word result (". Strlen ($data). "chars): <br>\n<textarea cols=100 rows=10>\n";
Set whether to ignore do not return participle symbol (punctuation, commonly-noted)
$WP->set_ignore_mark (FALSE);
Performs a shard, if the callback function is not set, returns an array of words
$WP->string_split ($data, "call_back");
$time _end = Getmicrotime ();
$time = $time _end-$time _start;
echo "</textarea><br>\n The time consuming: $time seconds <br>\n";
?>
<form method=post>
You can also enter text in the following text box, after submitting the test word segmentation effect:<br>
<textarea Name=testdat cols=100 rows=10></textarea><br>
<input type=submit>
</form>
Attached: <br>
<li> This program source code: <a href= "Chinese_segment.phps" >chinese_segment.php</a> (easy to implement way) </li>
<li> dictionaries needed: <a href= "dic.db" >dic.db</a> (gdbm format) </li>
Report:
(Simple Chinese word segmentation to achieve complete code and dictionary download)
Http://php.twomice.net/show_hdr.php?xname=BORRG11&dname=P7SRG11&xpos=19
(C edition Simple Chinese word Breaker Service Program (CSCWSD))
Http://php.twomice.net/show_hdr.php?xname=BORRG11&dname=P7SRG11&xpos=40