Open Source PHP Chinese Word segmentation system SCWS installation and use instance

Open Source PHP Chinese Word segmentation system SCWS installation and use instance _php instance

Last Update:2017-01-19 Source: Internet

Author: User

Tags bz2

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction of SCWS

SCWS is the initials of simple Chinese word segmentation (that is, a simplified Chinese word segmentation system).
This is a mechanical Chinese word segmentation engine based on word frequency dictionary, it can cut a whole paragraph of Chinese text into words basically correctly. The word is the smallest morpheme unit in Chinese, but when writing is not like English will be separated by space between the words, so how accurate and fast participle has always been a difficult task in Chinese word segmentation.
SCWS uses the pure C language development, does not rely on any external library function, may directly use the dynamic link library to embed the application, the support Chinese code includes GBK, UTF-8 and so on. In addition, PHP expansion module is provided, which can be used quickly and conveniently in PHP.
Word segmentation algorithm does not have too many innovative components, the use of their own collection of word frequency dictionary, supplemented by a certain number of proprietary names, names, place names, digital age and other rules to achieve the basic participle, by a small range of testing accuracy rate between 90% ~ 95%, basically can meet some small search engine, keyword extraction and other occasions to use. The first prototype version was released at the end of 2005.
SCWS was developed by Hightman and released in a BSD license agreement, source code hosted in GitHub.

Second, SCWS installation

Copy Code code as follows:

# wget-c HTTP://WWW.XUNSEARCH.COM/SCWS/DOWN/SCWS-1.2.1.TAR.BZ2
# tar JXVF scws-1.2.1.tar.bz2
# CD scws-1.2.1
#./configure--PREFIX=/USR/LOCAL/SCWS
# Make && make install

Three, Scws PHP extended Installation

Copy Code code as follows:

# CD./phpext
# phpize
#./configure--with-php-config=/usr/local/php5410/bin/php-config
# Make && make install
# echo ' [Scws] ' >>/usr/local/php5410/etc/php.ini
# echo "extension = scws.so" >>/usr/local/php5410/etc/php.ini
# echo "Scws.default.charset = utf-8" >>/usr/local/php5410/etc/php.ini
# echo "Scws.default.fpath =/usr/local/scws/etc/" >>/usr/local/php5410/etc/php.ini

Four, the word storehouse installs

Copy Code code as follows:

# wget HTTP://WWW.XUNSEARCH.COM/SCWS/DOWN/SCWS-DICT-CHS-UTF8.TAR.BZ2
# tar JXVF scws-dict-chs-utf8.tar.bz2-c/usr/local/scws/etc/
# chown Www:www/usr/local/scws/etc/dict.utf8.xdb

Five, php instance code. You can look at the SCWS official API description in detail

Copy Code code as follows:

Instantiate the core class of Word breaker
$so = Scws_new ();
Encoding used when setting participle
$so->set_charset (' utf-8 ');
Set up a dictionary for participle (use UTF8 dictionary here)
$so->set_dict ('/usr/local/scws/etc/dict.utf8.xdb ');
Set the rules for participle
$so->set_rule ('/usr/local/scws/etc/rules.utf8.ini ');
Remove punctuation before participle
$so->set_ignore (TRUE);
Whether duplex division, such as "Chinese" return "Chinese + people + Chinese" three words.
$so->set_multi (TRUE);
Set the text automatically to the two Word segmentation method aggregation
$so->set_duality (TRUE);
The statement to be participle
$so->send_text ("Welcome to IT development in the Martian Era");
Get the result of word segmentation, if extracting high-frequency words using Get_tops method
while ($tmp = $so->get_result ())
{
Print_r ($TMP);
}
$so->close ();

Return array Result Description:

Copy Code code as follows:

Word _string_ the word itself
IDF _float_ Inverse Text frequency
Off _int_ the position of the word in the original text path
attr _string_ Speech

Vi. Online APIs

You can also use the online API to implement Chinese participle, API address: http://www.xunsearch.com/scws/api.php, detailed instructions are also in the address.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More