Introduction to PHP Chinese word segmentation automatic keyword acquisition

Source: Internet
Author: User
We have used scws and phpanalysis, which are well known. For more information, see. The code is as follows:
Header ("Content-Type: text/html; charset = utf-8 ");
Define ('app _ root', str_replace ('\', '/', dirname (_ FILE __)));
$ Test = 'Here is a Chinese test code! ';
Function get_tags_arr ($ title)
{
Require (APP_ROOT. '/pscws4.class. php ');
$ Pscws = new PSCWS4 ();
$ Pscws-> set_dict (APP_ROOT. '/scws/dict. utf8.xdb ');
$ Pscws-> set_rule (APP_ROOT. '/scws/rules. utf8.ini ');
$ Pscws-> set_ignore (true );
$ Pscws-> send_text ($ title );
$ Words = $ pscws-> get_tops (5 );
$ Tags = array ();
Foreach ($ words as $ val ){
$ Tags [] = $ val ['word'];
}
$ Pscws-> close ();
Return $ tags;
}
Print_r (get_tags_arr ($ test ));
// ================================================ ======================================
Function get_keywords_str ($ content ){
Require (APP_ROOT. '/phpanalysis. class. php ');
PhpAnalysis: $ loadInit = false;
$ Pa = new PhpAnalysis ('utf-8', 'utf-8', false );
$ Pa-> LoadDict ();
$ Pa-> SetSource ($ content );
$ Pa-> StartAnalysis (false );
$ Tags = $ pa-> GetFinallyResult ();
Return $ tags;
}
Print (get_keywords_str ($ test ));

Related

SCWS-simple Chinese word segmentation system

In terms of concept, SCWS does not have any innovative components. it uses a word frequency dictionary collected by itself, supplemented by a certain set of rules such as proprietary names, names, place names, and digital ages, the accuracy of a small-scale test is approximately 90% ~ Between 95%, can basically meet the needs of some small and medium-sized search engines, Keyword extraction and other occasions. SCWS is developed using pure C code. it uses Unix-Like OS as the main platform environment and provides a shared function library to facilitate the implantation of various existing software systems. In addition, it supports GBK, UTF-8, BIG5 and other Chinese character encoding, word segmentation efficiency is high.

System Platform: Windows/Unix
Development Language: C
Usage: PHP extension

Demo URL: http://www.ftphp.com/scws/demo.php
Open Source official website: http://www.ftphp.com/scws/

Qingfeng notes: as a PHP extension, it is easy to continue integration with the existing PHP-based Web system, which is a major advantage.

PhpanAlysis-PHP component-less word splitting system

The PhpanAlysis word segmentation system is a string-matching word segmentation method. this method is also called the mechanical word segmentation method, it matches the Chinese character string to be analyzed with the entry in a "sufficiently large" machine dictionary according to certain policies. if a string is found in the dictionary, the match is successful (a word is recognized ). According to the scanning direction, the string matching and word segmentation methods can be divided into forward matching and reverse matching. according to the priority matching of different lengths, they can be divided into maximum (longest) matching and minimum (shortest) matching; based on whether it is combined with the part-of-speech tagging process, it can be divided into a simple word segmentation method and an integrated method combining word segmentation and tagging.

System platform: PHP environment

Development Language: PHP

Usage: HTTP service

Demo URL: http://www.itgrass.com/phpanalysis/
Open Source official website: http://www.itgrass.com/phpanalysis/

Qingfeng notes: It is easy to implement and easy to use. it can be used for some simple applications, but the computing efficiency of large data volumes is not as high as that of the previous ones.

I tried several systems and found that the basic word segmentation function is okay, but there are some differences in the division of some words. for part-of-speech determination, there are differences between systems.

Http://www.bitsCN.com/codes/40139.html

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.