Topic Center

Contact Sales

Home > Tutorials > PHP Tutorials

The principle and source code of PHP Chinese high-speed participle

Last Update:2016-06-13 Source: Internet

Author: User

Tags php source code

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The principle and source code of PHP Chinese high-speed participle

One, the disadvantage of the forward maximum matching algorithm and the inverse maximum matching algorithm

Forward maximum matching algorithm: From left to right, several consecutive characters in the text of the word will be matched to the thesaurus, and if so, a word is cut. But here's the problem: to do the best match, it's not the first match to be able to slice it. As an example: the People's Republic of China was established today . Scan from left to right, to be searched separately: China, Chinese, Chinese people, Chinese people, Chinese people, People's Republic of China, this day, today, today, became, into, established, has been established. 14 Search Thesaurus, Final segmentation result: People's Republic of China/today/established. Therefore, when encountering long words, it is very inefficient to retrieve multiple databases repeatedly. Also, a more serious problem is that the maximum length of a word is limited, in order to take into account the efficiency of the algorithm, it is impossible to set the maximum word length is very large, which will lead to longer words can not be correctly segmented.

Conversely, the inverse of the maximum matching algorithm, the long words will be separated, resulting in error segmentation. For example, the above text to be cut, from right to left scanning, to be retrieved separately:, established, established, set up, days, days, today, today, countries, countries, the Republic, the People's Republic, people, Peoples, China, China, China. 17 Word Query database, final segmentation results: China/People/Republic/today/established/. Cut the People's Republic of China into 3 words.

Second, the algorithm of overcoming the disadvantage of the maximal matching algorithm

In order to overcome the inefficient and non-segmented long words of the maximal matching algorithm, all the Chinese characters that can make up the vocabulary are indexed and used as the first letter of the word. Then the words that begin with each Chinese character are divided into a category, sorted by long words. The thesaurus structure is as follows:

Participle, by the Chinese character to find the beginning of the word phrase (length of 3000 or so linear search), and then by the length to the short 5,4,3,2 in order to retrieve the thesaurus, and to divide the word sentence (linear), if there is a match, then cut into a word, and then continue to match the next word. In this way, the efficiency of retrieval thesaurus is greatly improved, and the problem of arbitrary long term matching is solved.

In the implementation of the PHP algorithm, in order to speed up the online matching speed, the above thesaurus structure, in the form of the associative array of PHP implementation, all loaded into memory. In order to flexibly adding and removing the thesaurus, a string processing program is created to automatically generate a thesaurus of the PHP associative array structure. Detailed implementation algorithm, see PHP source code.

PHP Word source Download: HTTP://WWW.BOX.NET/SHARED/GRYSPZPPSB



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

php and mysql web development source code php projects with source code and database php get source code of url practical php and mysql source code php project with source code and documentation point of sale and inventory system source code in java download php project with source code and database

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The principle and source code of PHP Chinese high-speed participle

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support