Automatically retrieve keyword analysis based on Php

Source: Internet
Author: User
Currently, CMS comes with the collection function. the content and title are relatively easy to process, but in most cases it is difficult to extract keywords. Therefore, automatically obtaining keywords becomes a "traditional problem" in the current php cms class ". How can we automatically obtain keywords? the main steps are as follows:

Currently, CMS comes with the collection function. the content and title are relatively easy to process, but in most cases it is difficult to extract keywords. Therefore, automatically obtaining keywords becomes a "traditional problem" in the current php cms class ". Then how can we automatically obtain keywords? the main steps are as follows:

1. the title and content are separated by word segmentation algorithms to extract keywords and frequency in the content segmentation phase. Currently, the two main algorithms are the ICTCLAS and hidden Markov models of the Chinese Emy of Sciences. However, both of them are too high-end and have a certain threshold, and both of them only support C ++/JAVA. Currently, two PHP-based PSCWS and HTTPCWS are recommended. SCWS released the official version 1.0.0 on, and now the latest version has reached 1.0.4. PSCWS is its PHP version. HTTPCWS was developed by Zhang banquet. it was previously called PHPCWS. PHPCWS first uses the "ICTCLAS 3.0 shared Chinese word segmentation algorithm" API for initial word segmentation, and then uses the self-compiled "inverse maximum matching algorithm" to merge word segmentation and words, the punctuation filtering function is added to obtain word segmentation results. Currently, only Linux/Unix systems are supported.

2. compare the extracted results with the existing dictionary to obtain the most compliant keywords. here we mainly want to see the dictionary. we can define the dictionary ourselves or use the existing mature Dictionary.

3. then compare the two sets of keywords to obtain the keyword that best matches the current content. at this stage, the specific situation is analyzed. Currently, php cms has its own keyword extraction system. Among them, DEDECMS's word segmentation source code is widely circulated on the network. I have also tested it on my POPCMS, and the results are very good, however, meaningless words such as "we" are frequently extracted and listed as keywords. sometimes, HTML with spaces is also proposed as keywords, which must be improved. However, if it is used as an auxiliary function, it is quite good. In addition, the Automatic Keyword extraction function of PHPCMS and DISCUZ is also very powerful.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.