The principle and source code of PHP Chinese high-speed participle
One, the disadvantage of the forward maximum matching algorithm and the inverse maximum matching algorithm
Forward maximum matching algorithm: From left to right, several consecutive characters in the text of the word will be matched to the thesaurus, and if so, a word is cut. But here's the problem: to do the best match, it's not the first match to be able to slice it. As an example: the People's Republic of China was established today . Scan from left to right, to be searched separately: China, Chinese, Chinese people, Chinese people, Chinese people, People's Republic of China, this day, today, today, became, into, established, has been established. 14 Search Thesaurus, Final segmentation result: People's Republic of China/today/established. Therefore, when encountering long words, it is very inefficient to retrieve multiple databases repeatedly. Also, a more serious problem is that the maximum length of a word is limited, in order to take into account the efficiency of the algorithm, it is impossible to set the maximum word length is very large, which will lead to longer words can not be correctly segmented.
Conversely, the inverse of the maximum matching algorithm, the long words will be separated, resulting in error segmentation. For example, the above text to be cut, from right to left scanning, to be retrieved separately:, established, established, set up, days, days, today, today, countries, countries, the Republic, the People's Republic, people, Peoples, China, China, China. 17 Word Query database, final segmentation results: China/People/Republic/today/established/. Cut the People's Republic of China into 3 words.
Second, the algorithm of overcoming the disadvantage of the maximal matching algorithm
In order to overcome the inefficient and non-segmented long words of the maximal matching algorithm, all the Chinese characters that can make up the vocabulary are indexed and used as the first letter of the word. Then the words that begin with each Chinese character are divided into a category, sorted by long words. The thesaurus structure is as follows:
Participle, by the Chinese character to find the beginning of the word phrase (length of 3000 or so linear search), and then by the length to the short 5,4,3,2 in order to retrieve the thesaurus, and to divide the word sentence (linear), if there is a match, then cut into a word, and then continue to match the next word. In this way, the efficiency of retrieval thesaurus is greatly improved, and the problem of arbitrary long term matching is solved.
In the implementation of the PHP algorithm, in order to speed up the online matching speed, the above thesaurus structure, in the form of the associative array of PHP implementation, all loaded into memory. In order to flexibly adding and removing the thesaurus, a string processing program is created to automatically generate a thesaurus of the PHP associative array structure. Detailed implementation algorithm, see PHP source code.
PHP Word source Download: HTTP://WWW.BOX.NET/SHARED/GRYSPZPPSB