PHP Chinese Word segmentation simple Implementation code sharing _php skills

Source: Internet
Author: User
Tags php class
Of course, this article is not to do research on Chinese search engines, but to share if using PHP to do a site search engine. This article is an article in this system.
The word breaker I use is the Ictclas of the open source version of the CAS Computing Institute. There are also open source Bamboo, and I will then investigate the tool.
Starting from Ictclas is a good choice, because its algorithm spread more widely, there are open academic documents, and the compilation is simple, the library relies on less. But currently only provides the C + +, Java and C # version of the code, and no PHP version of the code. What do we do? It may be possible to learn from its C + + source and academic documentation, and then develop a PHP version out of it. However, I want to use interprocess communication to invoke the C + + version of the executable file in the PHP code.
Download the source code after decompression, in the C + + Development Library and the compilation environment of the machine directly make Ictclas can. Its Makefile script has an error, and the code that executes the test does not add '. /', of course not as successful as under Windows execution. However, it does not affect the compilation results.
Chinese word of the PHP class is in the following, with the Proc_open () function to execute the word breaker, and through the pipeline and its interaction, input to the text to be participle, read the word segmentation results.
Copy Code code as follows:

<?php
Class nlp{
private static $cmd _path;
Do not end with '/'
static function Set_cmd_path ($path) {
Self:: $cmd _path = $path;
}
Private function cmd ($STR) {
$descriptorspec = Array (
0 => Array ("Pipe", "R"),
1 => Array ("Pipe", "w"),
);
$cmd = self:: $cmd _path. "/ictclas";
$process = Proc_open ($cmd, $descriptorspec, $pipes);
if (Is_resource ($process)) {
$str = Iconv (' utf-8 ', ' GBK ', $str);
Fwrite ($pipes [0], $STR);
$output = Stream_get_contents ($pipes [1]);
Fclose ($pipes [0]);
Fclose ($pipes [1]);
$return _value = Proc_close ($process);
}
/*
$cmd = "printf ' $input ' |". Self:: $cmd _path. "/ictclas";
EXEC ($cmd, $output, $ret);
$output = Join ("\ n", $output);
*/
$output = Trim ($output);
$output = Iconv (' GBK ', ' utf-8 ', $output);
return $output;
}
/**
* To the word, return the list of words.
*/
function Tokenize ($STR) {
$tokens = Array ();
$output = Self::cmd ($input);
if ($output) {
$ps = Preg_split ('/\s+/', $output);
foreach ($ps as $p) {
List ($seg, $tag) = explode ('/', $p);
$item = Array (
' Seg ' => $seg,
' Tag ' => $tag,
);
$tokens [] = $item;
}
}
return $tokens;
}
}
Nlp::set_cmd_path (DirName (__file__));
?>

It is simple to use (make sure that the Ictclas compiled executables and dictionaries are in the current directory):
Copy Code code as follows:

<?php
Require_once (' nlp.php ');
Var_dump (Nlp::tokenize (' Hello, world! '));
?>

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.