Introduction of SCWS
SCWS is the initials of simple Chinese word segmentation (that is, a simplified Chinese word segmentation system).
This is a mechanical Chinese word segmentation engine based on word frequency dictionary, it can cut a whole paragraph of Chinese text into words basically correctly. The word is the smallest morpheme unit in Chinese, but when writing is not like English will be separated by space between the words, so how accurate and fast participle has always been a difficult task in Chinese word segmentation.
SCWS uses the pure C language development, does not rely on any external library function, may directly use the dynamic link library to embed the application, the support Chinese code includes GBK, UTF-8 and so on. In addition, PHP expansion module is provided, which can be used quickly and conveniently in PHP.
Word segmentation algorithm does not have too many innovative components, the use of their own collection of word frequency dictionary, supplemented by a certain number of proprietary names, names, place names, digital age and other rules to achieve the basic participle, by a small range of testing accuracy rate between 90% ~ 95%, basically can meet some small search engine, keyword extraction and other occasions to use. The first prototype version was released at the end of 2005.
SCWS was developed by Hightman and released in a BSD license agreement, source code hosted in GitHub.
Second, SCWS installation
Copy Code code as follows:
# wget-c HTTP://WWW.XUNSEARCH.COM/SCWS/DOWN/SCWS-1.2.1.TAR.BZ2
# tar JXVF scws-1.2.1.tar.bz2
# CD scws-1.2.1
#./configure--PREFIX=/USR/LOCAL/SCWS
# Make && make install
Three, Scws PHP extended Installation
Copy Code code as follows:
# CD./phpext
# phpize
#./configure--with-php-config=/usr/local/php5410/bin/php-config
# Make && make install
# echo ' [Scws] ' >>/usr/local/php5410/etc/php.ini
# echo "extension = scws.so" >>/usr/local/php5410/etc/php.ini
# echo "Scws.default.charset = utf-8" >>/usr/local/php5410/etc/php.ini
# echo "Scws.default.fpath =/usr/local/scws/etc/" >>/usr/local/php5410/etc/php.ini
Four, the word storehouse installs
Copy Code code as follows:
# wget HTTP://WWW.XUNSEARCH.COM/SCWS/DOWN/SCWS-DICT-CHS-UTF8.TAR.BZ2
# tar JXVF scws-dict-chs-utf8.tar.bz2-c/usr/local/scws/etc/
# chown Www:www/usr/local/scws/etc/dict.utf8.xdb
Five, php instance code. You can look at the SCWS official API description in detail
Copy Code code as follows:
Instantiate the core class of Word breaker
$so = Scws_new ();
Encoding used when setting participle
$so->set_charset (' utf-8 ');
Set up a dictionary for participle (use UTF8 dictionary here)
$so->set_dict ('/usr/local/scws/etc/dict.utf8.xdb ');
Set the rules for participle
$so->set_rule ('/usr/local/scws/etc/rules.utf8.ini ');
Remove punctuation before participle
$so->set_ignore (TRUE);
Whether duplex division, such as "Chinese" return "Chinese + people + Chinese" three words.
$so->set_multi (TRUE);
Set the text automatically to the two Word segmentation method aggregation
$so->set_duality (TRUE);
The statement to be participle
$so->send_text ("Welcome to IT development in the Martian Era");
Get the result of word segmentation, if extracting high-frequency words using Get_tops method
while ($tmp = $so->get_result ())
{
Print_r ($TMP);
}
$so->close ();
Return array Result Description:
Copy Code code as follows:
Word _string_ the word itself
IDF _float_ Inverse Text frequency
Off _int_ the position of the word in the original text path
attr _string_ Speech
Vi. Online APIs
You can also use the online API to implement Chinese participle, API address: http://www.xunsearch.com/scws/api.php, detailed instructions are also in the address.