Post to a netizen on the phpe.net forum.
The current search engine technology mainly includes four links: web page capturing, hyperlink analysis, web page retrieval and search services. Word segmentation refers to dividing a complete sentence into several words, and the search engine finds out the keywords for retrieval. Chinese word segmentation is the entry point of the Search Service and the foundation of the Chinese search engine. With good word segmentation technology, search engines can truly understand what information users need.
PHP implementation idea for writing Chinese word segmentation (because a project is being written recentlySource code, But it is not difficult). Although it remains to be improved in all aspects, the entire process is still relatively complete.
First, basic knowledge about word segmentation:
Word Segmentation technology research report
Http://www.lw86.com/lunwen/computer/ai/3818.html
Chinese search engine technology unveiling: Chinese Word Segmentation
Http://www.shi8.com/286.html
We recommend that you read the courseware (graduate courses of Peking University's Chinese Department ):
Http://ccl.pku.edu.cn/doubtfire/Course/Chinese%20Information%20Processing/2002_2003_1.htm
Http://ccl.pku.edu.cn/doubtfire/Course/Chinese%20Information%20Processing/contents/Chapter_07_1.ppt
Ii. Corpus preparation
The People's Daily corpus can be used. It is a product jointly developed by the Institute of computational linguistics of Peking University and Fujitsu. It processes 27 million words of the corpus of People's Daily, processing items include word segmentation, part-of-speech tagging, and exclusive nouns (exclusive noun phrases) tagging. It can be downloaded online, but it is not clear whether it is free. Please search for it by yourself.
The corpus should be processed as a dictionary for PHPProgram.
Third, Word SegmentationAlgorithmPrinciple
The most common method is the maximum matching method and the maximum probability method. To enhance accuracy and avoid ambiguity, you can combine multiple algorithms. The combination of multiple algorithms will lead to a decrease in speed and will be adopted as required by the project.
Fourth, PHP implementation of Word Segmentation
For more information about the principles, see the PPT slides recommended above. They only use PHP to load dictionary files and search strings. At present, the effect is good, but the efficiency is not very high. There is no way, such as PHP script language, the requirement cannot be too high. Next, try to write the word segmentation section in C and call it in PHP to test the efficiency.