Php simple Chinese word segmentation code _ PHP Tutorial

Source: Internet
Author: User
Php's simple Chinese word segmentation code. For Chinese search engines, Chinese word segmentation is one of the most basic parts of the system, because the Chinese word-based search algorithm is not very good at present. of course, this article is not for Chinese search engines. chinese word segmentation is one of the most basic parts of the entire system, because the Chinese search algorithm based on single words is not very good at present. of course, this article is not about researching Chinese search engines, but about using PHP as an in-site search engine. this is an article in this system.

The PHP class for Chinese word segmentation is located below. the proc_open () function is used to execute the word segmentation program and interact with the program through pipelines. the text to be segmented is input to read the word segmentation result.

Class NLP {
Private static $ pai_path;

// Does not end '/'
Static function set_cmd_path ($ path ){
Self: $ pai_path = $ path;
}

Private function cmd ($ str ){
$ Descriptorspec = array (
0 => array ("pipe", "r "),
1 => array ("pipe", "w "),
);
$ Cmd = self: $ export _path. "/ictclas ";
$ Process = proc_open ($ cmd, $ descriptorspec, $ pipes );

If (is_resource ($ process )){
$ Str = iconv ('utf-8', 'gbk', $ str );
Fwrite ($ pipes [0], $ str );
$ Output = stream_get_contents ($ pipes [1]);

Fclose ($ pipes [0]);
Fclose ($ pipes [1]);

$ Return_value = proc_close ($ process );
}

/*
$ Cmd = "printf '$ input' |". self: $ pai_path. "/ictclas ";
Exec ($ cmd, $ output, $ ret );
$ Output = join ("n", $ output );
*/

$ Output = trim ($ output );
$ Output = iconv ('gbk', 'utf-8', $ output );

Return $ output;
}

/**
* Returns the word list.
*/
Function tokenize ($ str ){
$ Tokens = array ();

$ Output = self: cmd ($ input );
If ($ output ){
$ Pstutorial = preg_split ('/s +/', $ output );
Foreach ($ ps as $ p ){
List ($ seg, $ tag) = explode ('/', $ p );
$ Item = array (
'Seg' => $ seg,
'Tag' => $ tag,
);
$ Tokens [] = $ item;
}
}

Return $ tokens;
}
}
NLP: set_pai_path (dirname (_ FILE __));
?>
It is easy to use (ensure that the executable files and dictionaries after ICTCLAS compilation are in the current directory ):

Require_once ('NLP. php ');
Var_dump (NLP: tokenize ('Hello, World! '));
?>


Webmaster experience,

If you want to achieve word segmentation for search engines, you need a powerful dictionary and more intelligent Chinese pinyin, writing, habits, and other functional libraries.


Chinese word segmentation is one of the most basic parts of the entire system, because the Chinese word-based Chinese search algorithm is not very good at present. of course, this article is not about Chinese search...

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.