The analysis of PHP's realization of automatic keyword acquisition

Source: Internet
Author: User
Tags php class
Now the CMS has its own collection function, the content and title is relatively good to deal with, but most of the situation is difficult to extract keywords. So automatically get the keyword becomes the current PHP class CMS "traditional problem". How can you automatically get keywords, the main steps can be divided into the following three steps:

1, by segmenting the title and content separately, extracting the keyword and frequency in the content of the word segmentation phase, the current main two algorithms are Ictclas and hidden Markov model of the Chinese Academy of Sciences. But these two are too high-end, have a certain threshold, and are only support C++/java. Two of the current PHP based PSCWS and HTTPCWS are worth recommending. SCWS released 1.0.0 official edition in 2008-03-08, and the latest version is now 1.0.4. PSCWS is its PHP version. And HTTPCWS is a banquet development, before called PHPCWS. PHPCWS First Use the API "Ictclas 3.0 share version Chinese Word segmentation algorithm" for the first word processing, and then use the "reverse Maximum matching algorithm" to the word segmentation and Word merge processing, and add punctuation filtering function, get word segmentation results. Only Linux/unix systems are currently supported.

2, compare the extraction results with the existing thesaurus, get the most consistent keyword here is to see the thesaurus, we can define our own thesaurus, can also use the existing mature thesaurus.

3, and then the two sets of keywords to compare, get the most consistent with the current content of the keyword at this stage is the specific situation specific analysis. The current PHP class CMS has its own extraction keyword system. One of the most widely circulated on the network is the Dedecms word source, I also tested on my popcms, the effect is very good, but similar to "we" and other meaningless word extraction and is listed as the frequency of the keyword is too high, and sometimes even the space of the HTML proposed as a keyword, urgent need to improve. But if it's an auxiliary function, it's already good. In addition Phpcms and discuz automatic extraction keyword function is also very powerful.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.