Php's simple Chinese word segmentation code. For Chinese search engines, Chinese word segmentation is one of the most basic parts of the system, because the Chinese word-based search algorithm is not very good at present. of course, this article is not for Chinese search engines. chinese word segmentation is one of the most basic parts of the entire system, because the Chinese search algorithm based on single words is not very good at present. of course, this article is not about researching Chinese search engines, but about using PHP as an in-site search engine. this is an article in this system.
The PHP class for Chinese word segmentation is located below. the proc_open () function is used to execute the word segmentation program and interact with the program through pipelines. the text to be segmented is input to read the word segmentation result.
Class NLP {
Private static $ pai_path;
// Does not end '/'
Static function set_cmd_path ($ path ){
Self: $ pai_path = $ path;
}
Private function cmd ($ str ){
$ Descriptorspec = array (
0 => array ("pipe", "r "),
1 => array ("pipe", "w "),
);
$ Cmd = self: $ export _path. "/ictclas ";
$ Process = proc_open ($ cmd, $ descriptorspec, $ pipes );
If (is_resource ($ process )){
$ Str = iconv ('utf-8', 'gbk', $ str );
Fwrite ($ pipes [0], $ str );
$ Output = stream_get_contents ($ pipes [1]);
Fclose ($ pipes [0]);
Fclose ($ pipes [1]);
$ Return_value = proc_close ($ process );
}
/*
$ Cmd = "printf '$ input' |". self: $ pai_path. "/ictclas ";
Exec ($ cmd, $ output, $ ret );
$ Output = join ("n", $ output );
*/
$ Output = trim ($ output );
$ Output = iconv ('gbk', 'utf-8', $ output );
Return $ output;
}
/**
* Returns the word list.
*/
Function tokenize ($ str ){
$ Tokens = array ();
$ Output = self: cmd ($ input );
If ($ output ){
$ Pstutorial = preg_split ('/s +/', $ output );
Foreach ($ ps as $ p ){
List ($ seg, $ tag) = explode ('/', $ p );
$ Item = array (
'Seg' => $ seg,
'Tag' => $ tag,
);
$ Tokens [] = $ item;
}
}
Return $ tokens;
}
}
NLP: set_pai_path (dirname (_ FILE __));
?>
It is easy to use (ensure that the executable files and dictionaries after ICTCLAS compilation are in the current directory ):
Require_once ('NLP. php ');
Var_dump (NLP: tokenize ('Hello, World! '));
?>
Webmaster experience,
If you want to achieve word segmentation for search engines, you need a powerful dictionary and more intelligent Chinese pinyin, writing, habits, and other functional libraries.
Chinese word segmentation is one of the most basic parts of the entire system, because the Chinese word-based Chinese search algorithm is not very good at present. of course, this article is not about Chinese search...