In-depth discussion on specific methods for PHP to automatically obtain keywords

Source: Internet
Author: User

In the current CMS, the collection function is provided. The content and title are relatively well processed, but in most cases the keywords are hard to be extracted. Therefore, Automatically Obtaining keywords becomes a "traditional problem" in the current php cms class ".

So how can we achieve PHP's automatic keyword Retrieval? The main steps are as follows:

PHP automatically obtains keyword 1, separates the title and content by word segmentation algorithm, and extracts the keywords and frequency.

At the content segmentation stage, the two main algorithms are the ICTCLAS and hidden Markov models of the Chinese Emy of sciences. However, both of them are too high-end and have a certain threshold, and both of them only support C ++/JAVA. Currently, two PHP-based PSCWS and HTTPCWS are recommended.

SCWS released the official version 1.0.0 on, and now the latest version has reached 1.0.4. PSCWS is its PHP version. HTTPCWS was developed by Zhang banquet. It was previously called PHPCWS.

PHPCWS first uses the "ICTCLAS 3.0 Shared Chinese word segmentation algorithm" API for initial word segmentation, and then uses the self-compiled "inverse maximum matching algorithm" to merge word segmentation and words, the punctuation filtering function is added to obtain word segmentation results. Currently, only Linux/Unix systems are supported.

PHP automatically obtains keyword 2 and compares the extracted result with the existing dictionary to obtain the keyword that best matches the rule.

Here we mainly want to look at the dictionary. we can define the dictionary ourselves, or use the existing mature dictionary.

PHP automatically obtains keyword 3 and compares the two sets of keywords to obtain the keyword that best matches the current content.

At this stage, the specific situation is analyzed. Currently, php cms has its own keyword extraction system. Among them, DEDECMS's Word Segmentation source code is widely circulated on the network. I have also tested it on my POPCMS, and the results are very good, however, meaningless words such as "we" are frequently extracted and listed as keywords. Sometimes, HTML with spaces is also proposed as keywords, which must be improved. However, if it is used as an auxiliary function, it is quite good.
In addition, PHP of PHPCMS and DISCUZ can automatically obtain keywords.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.