This article describes how to install and use SCWS, an open-source php Chinese word segmentation system. For more information, see
1. Introduction to SCWS
SCWS is the abbreviation of Simple Chinese Word Segmentation (I .e., Simple Chinese Word Segmentation system ).
This is a mechanical Chinese word segmentation engine based on the word frequency dictionary. it can divide a full range of Chinese text into words. Words are the smallest unit of Chinese characters, but they are not separated by spaces in English. Therefore, it is difficult to accurately and quickly word segmentation.
SCWS is developed in pure C language and does not rely on any external library function. dynamic link library can be directly used to embed the application. supported Chinese encoding include GBK and UTF-8. In addition, the PHP extension module is provided to quickly and conveniently use word segmentation in PHP.
The word segmentation algorithm does not have many innovative components. it uses a word frequency dictionary collected by itself, supplemented by some proprietary names, personal names, place names, digital ages, and other rules to achieve basic word segmentation, the accuracy of small-scale tests is between 90% and ~ Between 95% can basically meet the needs of some small search engines, Keyword extraction and other occasions. The first prototype was released at the end of 2005.
SCWS was developed by hightman and released open-source with the BSD license protocol. the source code is hosted on github.
II. scws installation
The code is as follows:
# wget -c http://www.xunsearch.com/scws/down/scws-1.2.1.tar.bz2# tar jxvf scws-1.2.1.tar.bz2# cd scws-1.2.1# ./configure --prefix=/usr/local/scws# make && make install
III. scws PHP extension installation
The code is as follows:
# cd ./phpext# phpize # ./configure --with-php-config=/usr/local/php5410/bin/php-config# make && make install# echo "[scws]" >> /usr/local/php5410/etc/php.ini # echo "extension = scws.so" >> /usr/local/php5410/etc/php.ini# echo "scws.default.charset = utf-8" >> /usr/local/php5410/etc/php.ini# echo "scws.default.fpath = /usr/local/scws/etc/" >> /usr/local/php5410/etc/php.ini
IV. dictionary installation
The code is as follows:
# wget http://www.xunsearch.com/scws/down/scws-dict-chs-utf8.tar.bz2# tar jxvf scws-dict-chs-utf8.tar.bz2 -C /usr/local/scws/etc/# chown www:www /usr/local/scws/etc/dict.utf8.xdb
V. php instance code. For details, refer to the official scws api description.
The code is as follows:
// Instantiate the core class of the word segmentation plug-in $ so = scws_new (); // sets the encoding used for word segmentation $ so-> set_charset ('utf-8 '); // Set the dictionary used for word segmentation (utf8 dictionary is used here) $ so-> set_dict ('/usr/local/scws/etc/dict. utf8.xdb '); // set the rule for word segmentation $ so-> set_rule ('/usr/local/scws/etc/rules. utf8.ini '); // remove the punctuation mark $ so-> set_ignore (true) before word segmentation; // whether to perform multiple segmentation, for example, "Chinese" returns the word "China + people + Chinese. $ So-> set_multi (true); // set to automatically aggregate text by two-word segmentation $ so-> set_duality (true ); // statement $ so-> send_text ("Welcome to IT development in the Fire Age"); // Obtain the word segmentation result, if you use the get_tops method to extract high-frequency words while ($ tmp = $ so-> get_result () {print_r ($ tmp) ;}$ so-> close ();
Description of returned array results:
The code is as follows:
Word _ string _ idf _ float _ inverse text word frequency off _ int _ position of the word in the original text path attr _ string _ part of speech
VI. online APIs
You can also use online APIs to perform Chinese word segmentation. the API address is http://www.xunsearch.com/scws/api.php. the detailed description is also in the address.