PHP Chinese word segmentation simple implementation code sharing

Source: Internet
Author: User
Chinese word segmentation is one of the most basic parts of the system for Chinese search engines, because Chinese word-based Chinese search algorithms are not very good at present. Of course, this article is not about researching Chinese search engines, but about using PHP as an in-site search engine. This is an article in this system.
The word segmentation tool I used is the open-source ICTCLAS version of the Institute of Computing Science and Technology of the Chinese Emy of Sciences. In addition, Bamboo, an open-source tool, will be investigated later.
Starting from ICTCLAS, it is a good choice, because its algorithms are widely spread, there are open academic documents, and the compilation is simple, with less Library dependencies. Currently, only C/C ++, Java, and C # versions are provided, and PHP version code is not available. What should we do? You may be able to learn its C/C ++ source code and academic documents, and then develop a PHP version. However, I want to use inter-process communication to call the executable files of C/C ++ in PHP code.
After downloading and decompressing the source code, make ictclas directly on a machine with a C ++ development library and compiling environment. Its Makefile script has an error. 'is not added to the code for testing '. /', Of course, cannot be executed successfully like in Windows. However, compilation results are not affected.
The PHP class for Chinese word segmentation is located below. the proc_open () function is used to execute the word segmentation program and interact with the program through pipelines. the text to be segmented is input to read the word segmentation result.
Copy codeThe code is as follows:
Class NLP {
Private static $ pai_path;
// Does not end '/'
Static function set_cmd_path ($ path ){
Self: $ pai_path = $ path;
}
Private function cmd ($ str ){
$ Descriptorspec = array (
0 => array ("pipe", "r "),
1 => array ("pipe", "w "),
);
$ Cmd = self: $ export _path. "/ictclas ";
$ Process = proc_open ($ cmd, $ descriptorspec, $ pipes );
If (is_resource ($ process )){
$ Str = iconv ('utf-8', 'gbk', $ str );
Fwrite ($ pipes [0], $ str );
$ Output = stream_get_contents ($ pipes [1]);
Fclose ($ pipes [0]);
Fclose ($ pipes [1]);
$ Return_value = proc_close ($ process );
}
/*
$ Cmd = "printf '$ input' |". self: $ pai_path. "/ictclas ";
Exec ($ cmd, $ output, $ ret );
$ Output = join ("\ n", $ output );
*/
$ Output = trim ($ output );
$ Output = iconv ('gbk', 'utf-8', $ output );
Return $ output;
}
/**
* Returns the word list.
*/
Function tokenize ($ str ){
$ Tokens = array ();
$ Output = self: cmd ($ input );
If ($ output ){
$ Ps = preg_split ('/\ s +/', $ output );
Foreach ($ ps as $ p ){
List ($ seg, $ tag) = explode ('/', $ p );
$ Item = array (
'Seg' => $ seg,
'Tag' => $ tag,
);
$ Tokens [] = $ item;
}
}
Return $ tokens;
}
}
NLP: set_pai_path (dirname (_ FILE __));
?>

It is easy to use (ensure that the executable files and dictionaries after ICTCLAS compilation are in the current directory ):
Copy codeThe code is as follows:
Require_once ('NLP. php ');
Var_dump (NLP: tokenize ('Hello, World! '));
?>
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.