Simple implementation of Chinese word segmentation with PHP

Source: Internet
Author: User
Tags array dname empty ord strlen
Chinese

Hehe, using PHP to do Chinese participle is not a sensible move,:p

Below is a dictionary that I find on the Internet, a simple word-breaker program.

(Note: The dictionary file is gdbm format, key is word value is frequency, about 40,000 common words)

Complete program demo and download see: http://root.twomice.net/my_php4/dict/chinese_segment.php

<?php
A simple way to realize Chinese word segmentation system
Tangent unit: A character that <128 the ASCII value
Common Double-byte symbol: "",. 、? “”;:! ¥ ... %$#@^&* () []{}|\/]
Can consider to add the most common Chinese characters: and is not AH (but there are special words such as "dozen" "Zheng He".:p)

Calculation time
function Getmicrotime () {
List ($usec, $sec) = Explode ("", Microtime ());
Return ((float) $usec + (float) $sec);
}
$time _start = Getmicrotime ();


Dictionary class
Class Ch_dictionary {
var $_id;

function ch_dictionary ($fname = "") {
if ($fname!= "") {
$this->load ($fname);
}
}

Load a dictionary from a filename (gdbm data file)
function Load ($fname) {
$this->_id = Dba_popen ($fname, "R", "GDBM");
if (! $this->_id) {
echo "Failed to open the dictionary. ($fname) <br>\n ";
Exit
}
}

   //According to Word return frequency, does not exist return-1
    function Find ($word) {
         $freq = Dba_fetch ($word, $this->_id);
        if (Is_bool ($freq)) $freq =-1;
        return $freq;
   }
}

//Sub-word: (reverse)
/////////////////////////////////////////////// Class Ch_word_split {
    var $_mb_mark_list;   //(Full-width punctuation for common segmentation sentences
     VAR $_word_maxlen;   //single word maximum possible length (Chinese characters)
    var $_dic;        /dictionary ...
    var $_ignore_mark;   /True or false
   
     function Ch_word_split () {
        $this->_mb_mark_list = Array (",", " ","。 ","! ","? ",":","......","、","“","”","《","》","(",")");
        $this->_word_maxlen  = 12;   //12 Chinese characters
        $this->_dic = NULL;
        $this->_ignore_mark = true;
   }

Set up a dictionary
function Set_dic ($fname) {
$this->_dic = new Ch_dictionary ($fname);
}

function Set_ignore_mark ($set) {
if (Is_bool ($set)) $this->_ignore_mark = $set;
}

Cut the string into sentences and cut into words
function String_split ($str, $func = "") {
$ret = Array ();

if ($func = = "" | |! Function_exists ($func)) $func = "";

$len = strlen ($STR);
$qtr = "";

for ($i = 0; $i < $len; $i + +) {
$char = $str [$i];

if (Ord ($char) < 0XA1) {
Read to a half-width character
if (!empty ($qtr)) {
$tmp = $this->_sen_split ($qtr);
$qtr = "";

                     if ($func!= "") Call_user_func ($func, $tmp);                    
                     Else $ret = Array_merge ($ret, $tmp);                    
               }

               //If it is a word or a number. According to Char Read data to >= 0XA1
                 if ($this->_is_alnum ($char)) {
                     do {
                         if (($i + 1) >= $len) break;
                         $char 2 = substr ($str, $i + 1, 1);
                         if (! $this->_is_alnum ($char 2)) break;

$char. = $char 2;
$i + +;
while (1);

if ($func!= "") Call_user_func ($func, Array ($char));
else $ret [] = $char;
}
ElseIf ($char = = ' | | $char = = ' \ t ') {
Nothing.
Continue
}
ElseIf (! $this->_ignore_mark) {
if ($func!= "") Call_user_func ($func, Array ($char));
else $ret [] = $char;
}
}
else {
Double-byte characters.
$i + +;
$char. = $str [$i];

if (In_array ($char, $this->_mb_mark_list)) {
if (!empty ($qtr)) {
$tmp = $this->_sen_split ($qtr);
$qtr = "";

if ($func!= "") Call_user_func ($func, $tmp);
else $ret = Array_merge ($ret, $tmp);
}

if (! $this->_ignore_mark) {
if ($func!= "") Call_user_func ($func, Array ($char));
else $ret [] = $char;
}
}
else {
$qtr. = $char;
}
}
}

if (strlen ($QTR) > 0) {
$tmp = $this->_sen_split ($qtr);

if ($func!= "") Call_user_func ($func, $tmp);
else $ret = Array_merge ($ret, $tmp);
}

return value
if ($func = = "") {
return $ret;
}
else {
return true;
}
}

Cut the sentence into words and reverse
function _sen_split ($sen) {
$len = strlen ($sen)/2;
$ret = Array ();

for ($i = $len-1; $i >= 0; $i-) {
Such as: This is a word breaker procedure

Get the last word first.
$w = substr ($sen, $i * 2, 2);

The final word length
$wlen = 1;

Start the reverse match to the maximum length.
$LF = 0; Last Freq
for ($j = 1; $j <= $this->_word_maxlen; $j + +) {
$o = $i-$j;
if ($o < 0) break;
$w 2 = substr ($sen, $o * 2, ($j + 1) * 2);

$tmp _f = $this->_dic->find ($w 2);
echo ' {$i}. {$j}: $w 2 (f: $tmp _f) \ n ";
if ($tmp _f > $lf) {
$LF = $tmp _f;
$wlen = $j + 1;
$w = $w 2;
}
}
$i offset according to the $wlen
$i = $i-$wlen + 1;
Array_push ($ret, $w);
}

$ret = Array_reverse ($ret);
return $ret;
}

Determine if the character is not an alphanumeric _-[0-9a-z_-]
function _is_alnum ($char) {
$ord = Ord ($char);
if ($ord = = | | $ord = = 95 | | ($ord >= && $ord <= 57))
return true;
if ($ord >= && $ord <= 122) | | ($ord >= && $ord <= 90))
return true;
return false;
}
}


The callback function after participle
function Call_back ($ar) {
foreach ($ar as $tmp) {
Echo $tmp. " ";
Flush ();
}
}

Instance (read from Sample.txt If there is no input):
$WP = new Ch_word_split ();
$WP->set_dic ("dic.db");

if (!isset ($_request[' Testdat ')) | | empty ($_request[' Testdat '])) {
$data = file_get_contents ("Sample.txt");
}
else {
$data = & $_request[' Testdat '];
}

Output
echo "echo "echo "Word result (". Strlen ($data). "chars): <br>\n<textarea cols=100 rows=10>\n";

Set whether to ignore do not return participle symbol (punctuation, commonly-noted)
$WP->set_ignore_mark (FALSE);

Performs a shard, if the callback function is not set, returns an array of words
$WP->string_split ($data, "call_back");

$time _end = Getmicrotime ();
$time = $time _end-$time _start;

echo "</textarea><br>\n The time consuming: $time seconds <br>\n";
?>
<form method=post>
You can also enter text in the following text box, after submitting the test word segmentation effect:<br>
<textarea Name=testdat cols=100 rows=10></textarea><br>
<input type=submit>
</form>
Attached: <br>
<li> This program source code: <a href= "Chinese_segment.phps" >chinese_segment.php</a> (easy to implement way) </li>
<li> dictionaries needed: <a href= "dic.db" >dic.db</a> (gdbm format) </li>



Report:
(Simple Chinese word segmentation to achieve complete code and dictionary download)
Http://php.twomice.net/show_hdr.php?xname=BORRG11&dname=P7SRG11&xpos=19
(C edition Simple Chinese word Breaker Service Program (CSCWSD))
Http://php.twomice.net/show_hdr.php?xname=BORRG11&dname=P7SRG11&xpos=40




Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.