Installation and Use of SCWS in open-source php Chinese word segmentation system

Source: Internet
Author: User
This article describes how to install and use SCWS, an open-source php Chinese word segmentation system. For more information, see 1. Introduction to SCWS

SCWS is the abbreviation of Simple Chinese Word Segmentation (I .e., Simple Chinese Word Segmentation system ).
This is a mechanical Chinese word segmentation engine based on the word frequency dictionary. it can divide a full range of Chinese text into words. Words are the smallest unit of Chinese characters, but they are not separated by spaces in English. Therefore, it is difficult to accurately and quickly word segmentation.
SCWS is developed in pure C language and does not rely on any external library function. dynamic link library can be directly used to embed the application. supported Chinese encoding include GBK and UTF-8. In addition, the PHP extension module is provided to quickly and conveniently use word segmentation in PHP.
The word segmentation algorithm does not have many innovative components. it uses a word frequency dictionary collected by itself, supplemented by some proprietary names, personal names, place names, digital ages, and other rules to achieve basic word segmentation, the accuracy of small-scale tests is between 90% and ~ Between 95% can basically meet the needs of some small search engines, Keyword extraction and other occasions. The first prototype was released at the end of 2005.
SCWS was developed by hightman and released open-source with the BSD license protocol. the source code is hosted on github.

II. scws installation

The code is as follows:
# Wget-c http://www.xunsearch.com/scws/down/scws-1.2.1.tar.bz2
# Tar jxvf scws-1.2.1.tar.bz2
# Cd scws-1.2.1
#./Configure -- prefix =/usr/local/scws
# Make & make install

III. scws PHP extension installation

The code is as follows:
# Cd./phpext
# Phpize
#./Configure -- with-php-config =/usr/local/php5410/bin/php-config
# Make & make install
# Echo "[scws]">/usr/local/php5410/etc/php. ini
# Echo "extension = scws. so">/usr/local/php5410/etc/php. ini
# Echo "scws. default. charset = utf-8">/usr/local/php5410/etc/php. ini
# Echo "scws. default. fpath =/usr/local/scws/etc/">/usr/local/php5410/etc/php. ini

IV. dictionary installation

The code is as follows:
# Wget http://www.xunsearch.com/scws/down/scws-dict-chs-utf8.tar.bz2
# Tar jxvf scws-dict-chs-utf8.tar.bz2-C/usr/local/scws/etc/
# Chown www: www/usr/local/scws/etc/dict. utf8.xdb

V. php instance code. For details, refer to the official scws api description.

The code is as follows:
// Instantiate the core class of the word segmentation plug-in
$ So = scws_new ();
// Sets the encoding used for word segmentation.
$ So-> set_charset ('utf-8 ');
// Set the dictionary used for word segmentation (utf8 dictionary is used here)
$ So-> set_dict ('/usr/local/scws/etc/dict. utf8.xdb ');
// Set rules for word segmentation
$ So-> set_rule ('/usr/local/scws/etc/rules. utf8.ini ');
// Remove punctuation before word splitting
$ So-> set_ignore (true );
// Whether to perform duplex Segmentation. for example, if "Chinese" is returned, the word "China + people + Chinese" is returned.
$ So-> set_multi (true );
// Set to automatically combine text with two-word segmentation
$ So-> set_duality (true );
// Statement to be segmented
$ So-> send_text ("Welcome to IT development in the Fire Age ");
// Obtain the word splitting result. if the frequently used word is extracted, use the get_tops method.
While ($ tmp = $ so-> get_result ())
{
Print_r ($ tmp );
}
$ So-> close ();
Description of returned array results:
The code is as follows:
Word _ string _ word itself
Idf _ float _ inverse text word frequency
Off _ int _ location of the word in the original text path
Attr _ string _ part of speech

VI. online APIs

You can also use online APIs to perform Chinese word segmentation. the API address is http://www.xunsearch.com/scws/api.php. the detailed description is also in the address.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.