The principle and source code of PHP Chinese high-speed participle

Source: Internet
Author: User
Tags php source code
The principle and source code of PHP Chinese high-speed participle

One, the disadvantage of the forward maximum matching algorithm and the inverse maximum matching algorithm

Forward maximum matching algorithm: From left to right, several consecutive characters in the text of the word will be matched to the thesaurus, and if so, a word is cut. But here's the problem: to do the best match, it's not the first match to be able to slice it. As an example: the People's Republic of China was established today . Scan from left to right, to be searched separately: China, Chinese, Chinese people, Chinese people, Chinese people, People's Republic of China, this day, today, today, became, into, established, has been established. 14 Search Thesaurus, Final segmentation result: People's Republic of China/today/established. Therefore, when encountering long words, it is very inefficient to retrieve multiple databases repeatedly. Also, a more serious problem is that the maximum length of a word is limited, in order to take into account the efficiency of the algorithm, it is impossible to set the maximum word length is very large, which will lead to longer words can not be correctly segmented.

Conversely, the inverse of the maximum matching algorithm, the long words will be separated, resulting in error segmentation. For example, the above text to be cut, from right to left scanning, to be retrieved separately:, established, established, set up, days, days, today, today, countries, countries, the Republic, the People's Republic, people, Peoples, China, China, China. 17 Word Query database, final segmentation results: China/People/Republic/today/established/. Cut the People's Republic of China into 3 words.

Second, the algorithm of overcoming the disadvantage of the maximal matching algorithm

In order to overcome the inefficient and non-segmented long words of the maximal matching algorithm, all the Chinese characters that can make up the vocabulary are indexed and used as the first letter of the word. Then the words that begin with each Chinese character are divided into a category, sorted by long words. The thesaurus structure is as follows:

Participle, by the Chinese character to find the beginning of the word phrase (length of 3000 or so linear search), and then by the length to the short 5,4,3,2 in order to retrieve the thesaurus, and to divide the word sentence (linear), if there is a match, then cut into a word, and then continue to match the next word. In this way, the efficiency of retrieval thesaurus is greatly improved, and the problem of arbitrary long term matching is solved.

In the implementation of the PHP algorithm, in order to speed up the online matching speed, the above thesaurus structure, in the form of the associative array of PHP implementation, all loaded into memory. In order to flexibly adding and removing the thesaurus, a string processing program is created to automatically generate a thesaurus of the PHP associative array structure. Detailed implementation algorithm, see PHP source code.

PHP Word source Download: HTTP://WWW.BOX.NET/SHARED/GRYSPZPPSB

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.