Simple implementation of Chinese word segmentation with PHP

Last Update:2017-02-28 Source: Internet

Author: User

Tags array dname empty ord strlen

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Chinese

Hehe, using PHP to do Chinese participle is not a sensible move,:p

Below is a dictionary that I find on the Internet, a simple word-breaker program.

(Note: The dictionary file is gdbm format, key is word value is frequency, about 40,000 common words)

Complete program demo and download see: http://root.twomice.net/my_php4/dict/chinese_segment.php

<?php
A simple way to realize Chinese word segmentation system
Tangent unit: A character that <128 the ASCII value
Common Double-byte symbol: "",. 、？ “”；：！　￥ ... %$#@^&* () []{}|\/]
Can consider to add the most common Chinese characters: and is not AH (but there are special words such as "dozen" "Zheng He".:p)

Calculation time
function Getmicrotime () {
List ($usec, $sec) = Explode ("", Microtime ());
Return ((float) $usec + (float) $sec);
}
$time _start = Getmicrotime ();

Dictionary class
Class Ch_dictionary {
var $_id;

function ch_dictionary ($fname = "") {
if ($fname!= "") {
$this->load ($fname);
}
}

Load a dictionary from a filename (gdbm data file)
function Load ($fname) {
$this->_id = Dba_popen ($fname, "R", "GDBM");
if (! $this->_id) {
echo "Failed to open the dictionary. ($fname) <br>\n ";
Exit
}
}

   //According to Word return frequency, does not exist return-1
    function Find ($word) {
         $freq = Dba_fetch ($word, $this->_id);
        if (Is_bool ($freq)) $freq =-1;
        return $freq;
   }
}

//Sub-word: (reverse)
/////////////////////////////////////////////// Class Ch_word_split {
    var $_mb_mark_list;   //(Full-width punctuation for common segmentation sentences
   VAR $_word_maxlen;   //single word maximum possible length (Chinese characters)
    var $_dic;        /dictionary ...
    var $_ignore_mark;   /True or false

   function Ch_word_split () {
        $this->_mb_mark_list = Array (",", "　","。 ","！ ","？ ","：","......","、","“","”","《","》","（","）");
        $this->_word_maxlen = 12;   //12 Chinese characters
        $this->_dic = NULL;
        $this->_ignore_mark = true;
   }

Set up a dictionary
function Set_dic ($fname) {
$this->_dic = new Ch_dictionary ($fname);
}

function Set_ignore_mark ($set) {
if (Is_bool ($set)) $this->_ignore_mark = $set;
}

Cut the string into sentences and cut into words
function String_split ($str, $func = "") {
$ret = Array ();

if ($func = = "" | |! Function_exists ($func)) $func = "";

$len = strlen ($STR);
$qtr = "";

for ($i = 0; $i < $len; $i + +) {
$char = $str [$i];

if (Ord ($char) < 0XA1) {
Read to a half-width character
if (!empty ($qtr)) {
$tmp = $this->_sen_split ($qtr);
$qtr = "";

                     if ($func!= "") Call_user_func ($func, $tmp);
                     Else $ret = Array_merge ($ret, $tmp);
               }

               //If it is a word or a number. According to Char Read data to >= 0XA1
               if ($this->_is_alnum ($char)) {
                     do {
                         if (($i + 1) >= $len) break;
                         $char 2 = substr ($str, $i + 1, 1);
                         if (! $this->_is_alnum ($char 2)) break;

$char. = $char 2;
$i + +;
while (1);

if ($func!= "") Call_user_func ($func, Array ($char));
else $ret [] = $char;
}
ElseIf ($char = = ' | | $char = = ' \ t ') {
Nothing.
Continue
}
ElseIf (! $this->_ignore_mark) {
if ($func!= "") Call_user_func ($func, Array ($char));
else $ret [] = $char;
}
}
else {
Double-byte characters.
$i + +;
$char. = $str [$i];

if (In_array ($char, $this->_mb_mark_list)) {
if (!empty ($qtr)) {
$tmp = $this->_sen_split ($qtr);
$qtr = "";

if ($func!= "") Call_user_func ($func, $tmp);
else $ret = Array_merge ($ret, $tmp);
}

if (! $this->_ignore_mark) {
if ($func!= "") Call_user_func ($func, Array ($char));
else $ret [] = $char;
}
}
else {
$qtr. = $char;
}
}
}

if (strlen ($QTR) > 0) {
$tmp = $this->_sen_split ($qtr);

if ($func!= "") Call_user_func ($func, $tmp);
else $ret = Array_merge ($ret, $tmp);
}

return value
if ($func = = "") {
return $ret;
}
else {
return true;
}
}

Cut the sentence into words and reverse
function _sen_split ($sen) {
$len = strlen ($sen)/2;
$ret = Array ();

for ($i = $len-1; $i >= 0; $i-) {
Such as: This is a word breaker procedure

Get the last word first.
$w = substr ($sen, $i * 2, 2);

The final word length
$wlen = 1;

Start the reverse match to the maximum length.
$LF = 0; Last Freq
for ($j = 1; $j <= $this->_word_maxlen; $j + +) {
$o = $i-$j;
if ($o < 0) break;
$w 2 = substr ($sen, $o * 2, ($j + 1) * 2);

$tmp _f = $this->_dic->find ($w 2);
echo ' {$i}. {$j}: $w 2 (f: $tmp _f) \ n ";
if ($tmp _f > $lf) {
$LF = $tmp _f;
$wlen = $j + 1;
$w = $w 2;
}
}
$i offset according to the $wlen
$i = $i-$wlen + 1;
Array_push ($ret, $w);
}

$ret = Array_reverse ($ret);
return $ret;
}

Determine if the character is not an alphanumeric _-[0-9a-z_-]
function _is_alnum ($char) {
$ord = Ord ($char);
if ($ord = = | | $ord = = 95 | | ($ord >= && $ord <= 57))
return true;
if ($ord >= && $ord <= 122) | | ($ord >= && $ord <= 90))
return true;
return false;
}
}

The callback function after participle
function Call_back ($ar) {
foreach ($ar as $tmp) {
Echo $tmp. " ";
Flush ();
}
}

Instance (read from Sample.txt If there is no input):
$WP = new Ch_word_split ();
$WP->set_dic ("dic.db");

if (!isset ($_request[' Testdat ')) | | empty ($_request[' Testdat '])) {
$data = file_get_contents ("Sample.txt");
}
else {
$data = & $_request[' Testdat '];
}

Output
echo "echo "echo "Word result (". Strlen ($data). "chars): <br>\n<textarea cols=100 rows=10>\n";

Set whether to ignore do not return participle symbol (punctuation, commonly-noted)
$WP->set_ignore_mark (FALSE);

Performs a shard, if the callback function is not set, returns an array of words
$WP->string_split ($data, "call_back");

$time _end = Getmicrotime ();
$time = $time _end-$time _start;

echo "</textarea><br>\n The time consuming: $time seconds <br>\n";
?>
<form method=post>
You can also enter text in the following text box, after submitting the test word segmentation effect:<br>
<textarea Name=testdat cols=100 rows=10></textarea><br>
<input type=submit>
</form>
Attached: <br>
<li> This program source code: <a href= "Chinese_segment.phps" >chinese_segment.php</a> (easy to implement way) </li>
<li> dictionaries needed: <a href= "dic.db" >dic.db</a> (gdbm format) </li>

Report:
(Simple Chinese word segmentation to achieve complete code and dictionary download)
Http://php.twomice.net/show_hdr.php?xname=BORRG11&dname=P7SRG11&xpos=19
(C edition Simple Chinese word Breaker Service Program (CSCWSD))
Http://php.twomice.net/show_hdr.php?xname=BORRG11&dname=P7SRG11&xpos=40

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Simple implementation of Chinese word segmentation with PHP

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Simple implementation of Chinese word segmentation with PHP

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support