PHP Chinese Word Segmentation extension SCWS,

Source: Internet
Author: User

PHP Chinese Word Segmentation extension SCWS,
1. Introduction to scws
SCWS is the abbreviation of Simple Chinese Word Segmentation (I .e., Simple Chinese Word Segmentation System ).
This is a mechanical Chinese Word Segmentation engine based on the Word Frequency dictionary. It can divide a full range of Chinese text into words. Words are the smallest unit of Chinese characters, but they are not separated by spaces in English. Therefore, it is difficult to accurately and quickly word segmentation.
SCWS is developed in pure C language and does not rely on any external library function. Dynamic Link Library can be directly used to embed the application. supported Chinese encoding include GBK and UTF-8. In addition, the PHP extension module is provided to quickly and conveniently use Word Segmentation in PHP.

The word segmentation algorithm does not have many innovative components. It uses a Word Frequency dictionary collected by itself, supplemented by some proprietary names, personal names, place names, digital ages, and other rules to achieve basic word segmentation, the accuracy of small-scale tests is between 90% and ~ Between 95% can basically meet the needs of some small search engines, keyword extraction and other occasions.

2. Download and install scws
Scws supports windows and linux/unix platforms. The following example shows how to install scws in windwos:
[1] Download php_scws.dll file, XDB dictionary file, rule set file, specific can refer to the URL: http://www.xunsearch.com/scws/download.php
[2] copy the php_scws.dll file to the php installation path.
[3] decompress the XDB dictionary file and rule set file to the corresponding drive letter, for example, D:/ceshi
[4] modify php. ini and add the following code:

extension = php_scws.dllscws.default.charset = gbkscws.default.fpath = "D:\ceshi\"

3. Simple case of scws Word Segmentation

<? Php $ sh = scws_open (); scws_set_charset ($ sh, 'gbk'); $ text = "I am a Chinese, I will use the C ++ language, I also have many T-shirt clothes "; scws_send_text ($ sh, $ text); $ top = scws_get_tops ($ sh, 5); print_r ($ top);?>





Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.