Open Source PHP Chinese Word segmentation system SCWS installation and use instance _php instance

Source: Internet
Author: User
Tags bz2

Introduction of SCWS

SCWS is the initials of simple Chinese word segmentation (that is, a simplified Chinese word segmentation system).
This is a mechanical Chinese word segmentation engine based on word frequency dictionary, it can cut a whole paragraph of Chinese text into words basically correctly. The word is the smallest morpheme unit in Chinese, but when writing is not like English will be separated by space between the words, so how accurate and fast participle has always been a difficult task in Chinese word segmentation.
SCWS uses the pure C language development, does not rely on any external library function, may directly use the dynamic link library to embed the application, the support Chinese code includes GBK, UTF-8 and so on. In addition, PHP expansion module is provided, which can be used quickly and conveniently in PHP.
Word segmentation algorithm does not have too many innovative components, the use of their own collection of word frequency dictionary, supplemented by a certain number of proprietary names, names, place names, digital age and other rules to achieve the basic participle, by a small range of testing accuracy rate between 90% ~ 95%, basically can meet some small search engine, keyword extraction and other occasions to use. The first prototype version was released at the end of 2005.
SCWS was developed by Hightman and released in a BSD license agreement, source code hosted in GitHub.

Second, SCWS installation

Copy Code code as follows:

# wget-c HTTP://WWW.XUNSEARCH.COM/SCWS/DOWN/SCWS-1.2.1.TAR.BZ2
# tar JXVF scws-1.2.1.tar.bz2
# CD scws-1.2.1
#./configure--PREFIX=/USR/LOCAL/SCWS
# Make && make install

Three, Scws PHP extended Installation

Copy Code code as follows:

# CD./phpext
# phpize
#./configure--with-php-config=/usr/local/php5410/bin/php-config
# Make && make install
# echo ' [Scws] ' >>/usr/local/php5410/etc/php.ini
# echo "extension = scws.so" >>/usr/local/php5410/etc/php.ini
# echo "Scws.default.charset = utf-8" >>/usr/local/php5410/etc/php.ini
# echo "Scws.default.fpath =/usr/local/scws/etc/" >>/usr/local/php5410/etc/php.ini

Four, the word storehouse installs

Copy Code code as follows:

# wget HTTP://WWW.XUNSEARCH.COM/SCWS/DOWN/SCWS-DICT-CHS-UTF8.TAR.BZ2
# tar JXVF scws-dict-chs-utf8.tar.bz2-c/usr/local/scws/etc/
# chown Www:www/usr/local/scws/etc/dict.utf8.xdb

Five, php instance code. You can look at the SCWS official API description in detail

Copy Code code as follows:

Instantiate the core class of Word breaker
$so = Scws_new ();
Encoding used when setting participle
$so->set_charset (' utf-8 ');
Set up a dictionary for participle (use UTF8 dictionary here)
$so->set_dict ('/usr/local/scws/etc/dict.utf8.xdb ');
Set the rules for participle
$so->set_rule ('/usr/local/scws/etc/rules.utf8.ini ');
Remove punctuation before participle
$so->set_ignore (TRUE);
Whether duplex division, such as "Chinese" return "Chinese + people + Chinese" three words.
$so->set_multi (TRUE);
Set the text automatically to the two Word segmentation method aggregation
$so->set_duality (TRUE);
The statement to be participle
$so->send_text ("Welcome to IT development in the Martian Era");
Get the result of word segmentation, if extracting high-frequency words using Get_tops method
while ($tmp = $so->get_result ())
{
Print_r ($TMP);
}
$so->close ();

Return array Result Description:
Copy Code code as follows:

Word _string_ the word itself
IDF _float_ Inverse Text frequency
Off _int_ the position of the word in the original text path
attr _string_ Speech

Vi. Online APIs

You can also use the online API to implement Chinese participle, API address: http://www.xunsearch.com/scws/api.php, detailed instructions are also in the address.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.