PHP Chinese Word Segmentation extension SCWS,
1. Introduction to scws
SCWS is the abbreviation of Simple Chinese Word Segmentation (I .e., Simple Chinese Word Segmentation System ).
This is a mechanical Chinese Word Segmentation engine based on the Word Frequency dictionary. It can divide a full range of Chinese text into words. Words are the smallest unit of Chinese characters, but they are not separated by spaces in English. Therefore, it is difficult to accurately and quickly word segmentation.
SCWS is developed in pure C language and does not rely on any external library function. Dynamic Link Library can be directly used to embed the application. supported Chinese encoding include GBK and UTF-8. In addition, the PHP extension module is provided to quickly and conveniently use Word Segmentation in PHP.
The word segmentation algorithm does not have many innovative components. It uses a Word Frequency dictionary collected by itself, supplemented by some proprietary names, personal names, place names, digital ages, and other rules to achieve basic word segmentation, the accuracy of small-scale tests is between 90% and ~ Between 95% can basically meet the needs of some small search engines, keyword extraction and other occasions.
2. Download and install scws
Scws supports windows and linux/unix platforms. The following example shows how to install scws in windwos:
[1] Download php_scws.dll file, XDB dictionary file, rule set file, specific can refer to the URL: http://www.xunsearch.com/scws/download.php
[2] copy the php_scws.dll file to the php installation path.
[3] decompress the XDB dictionary file and rule set file to the corresponding drive letter, for example, D:/ceshi
[4] modify php. ini and add the following code:
extension = php_scws.dllscws.default.charset = gbkscws.default.fpath = "D:\ceshi\"
3. Simple case of scws Word Segmentation
<? Php $ sh = scws_open (); scws_set_charset ($ sh, 'gbk'); $ text = "I am a Chinese, I will use the C ++ language, I also have many T-shirt clothes "; scws_send_text ($ sh, $ text); $ top = scws_get_tops ($ sh, 5); print_r ($ top);?>