Implementation of Chinese Word Segmentation

Source: Internet
Author: User
Post to a netizen on the phpe.net forum.

The current search engine technology mainly includes four links: web page capturing, hyperlink analysis, web page retrieval and search services. Word segmentation refers to dividing a complete sentence into several words, and the search engine finds out the keywords for retrieval. Chinese word segmentation is the entry point of the Search Service and the foundation of the Chinese search engine. With good word segmentation technology, search engines can truly understand what information users need.

PHP implementation idea for writing Chinese word segmentation (because a project is being written recentlySource code, But it is not difficult). Although it remains to be improved in all aspects, the entire process is still relatively complete.

First, basic knowledge about word segmentation:

Word Segmentation technology research report
Http://www.lw86.com/lunwen/computer/ai/3818.html

Chinese search engine technology unveiling: Chinese Word Segmentation
Http://www.shi8.com/286.html

We recommend that you read the courseware (graduate courses of Peking University's Chinese Department ):
Http://ccl.pku.edu.cn/doubtfire/Course/Chinese%20Information%20Processing/2002_2003_1.htm
Http://ccl.pku.edu.cn/doubtfire/Course/Chinese%20Information%20Processing/contents/Chapter_07_1.ppt

Ii. Corpus preparation
The People's Daily corpus can be used. It is a product jointly developed by the Institute of computational linguistics of Peking University and Fujitsu. It processes 27 million words of the corpus of People's Daily, processing items include word segmentation, part-of-speech tagging, and exclusive nouns (exclusive noun phrases) tagging. It can be downloaded online, but it is not clear whether it is free. Please search for it by yourself.

The corpus should be processed as a dictionary for PHPProgram.

Third, Word SegmentationAlgorithmPrinciple
The most common method is the maximum matching method and the maximum probability method. To enhance accuracy and avoid ambiguity, you can combine multiple algorithms. The combination of multiple algorithms will lead to a decrease in speed and will be adopted as required by the project.

Fourth, PHP implementation of Word Segmentation
For more information about the principles, see the PPT slides recommended above. They only use PHP to load dictionary files and search strings. At present, the effect is good, but the efficiency is not very high. There is no way, such as PHP script language, the requirement cannot be too high. Next, try to write the word segmentation section in C and call it in PHP to test the efficiency.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.