PHP Chinese word segmentation simple implementation code sharing. Of course, this article is not about researching Chinese search engines, but about using PHP as an in-site search engine. This is an article in this system. The word splitting tool I used is moderate. of course, this article does not want to study Chinese search engines, but shares with me how to use PHP for an in-site search engine. This is an article in this system.
The word segmentation tool I used is the open-source ICTCLAS version of the Institute of Computing Science and Technology of the Chinese Emy of Sciences. In addition, Bamboo, an open-source tool, will be investigated later.
Starting from ICTCLAS, it is a good choice, because its algorithms are widely spread, there are open academic documents, and the compilation is simple, with less Library dependencies. Currently, only C/C ++, Java, and C # versions are provided, and PHP version code is not available. What should we do? You may be able to learn its C/C ++ source code and academic documents, and then develop a PHP version. However, I want to use inter-process communication to call the executable files of C/C ++ in PHP code.
After downloading and decompressing the source code, make ictclas directly on a machine with a C ++ development library and compiling environment. Its Makefile script has an error. 'is not added to the code for testing '. /', Of course, cannot be executed successfully like in Windows. However, compilation results are not affected.
The PHP class for Chinese word segmentation is located below. the proc_open () function is used to execute the word segmentation program and interact with the program through pipelines. the text to be segmented is input to read the word segmentation result.
The code is as follows:
Class NLP {
Private static $ pai_path;
// Does not end '/'
Static function set_cmd_path ($ path ){
Self: $ pai_path = $ path;
}
Private function cmd ($ str ){
$ Descriptorspec = array (
0 => array ("pipe", "r "),
1 => array ("pipe", "w "),
);
$ Cmd = self: $ export _path. "/ictclas ";
$ Process = proc_open ($ cmd, $ descriptorspec, $ pipes );
If (is_resource ($ process )){
$ Str = iconv ('utf-8', 'gbk', $ str );
Fwrite ($ pipes [0], $ str );
$ Output = stream_get_contents ($ pipes [1]);
Fclose ($ pipes [0]);
Fclose ($ pipes [1]);
$ Return_value = proc_close ($ process );
}
/*
$ Cmd = "printf '$ input' |". self: $ pai_path. "/ictclas ";
Exec ($ cmd, $ output, $ ret );
$ Output = join ("\ n", $ output );
*/
$ Output = trim ($ output );
$ Output = iconv ('gbk', 'utf-8', $ output );
Return $ output;
}
/**
* Returns the word list.
*/
Function tokenize ($ str ){
$ Tokens = array ();
$ Output = self: cmd ($ input );
If ($ output ){
$ Ps = preg_split ('/\ s +/', $ output );
Foreach ($ ps as $ p ){
List ($ seg, $ tag) = explode ('/', $ p );
$ Item = array (
'Seg' => $ seg,
'Tag' => $ tag,
);
$ Tokens [] = $ item;
}
}
Return $ tokens;
}
}
NLP: set_pai_path (dirname (_ FILE __));
?>
It is easy to use (ensure that the executable files and dictionaries after ICTCLAS compilation are in the current directory ):
The code is as follows:
Require_once ('NLP. php ');
Var_dump (NLP: tokenize ('Hello, World! '));
?>
This article is not about researching Chinese search engines, but about using PHP as an in-site search engine. This is an article in this system. The word splitting tool I used is...