PHP to achieve automatic access to generate Article topic keyword function in-depth analysis _php example

Source: Internet
Author: User
Tags php class
Previously written programs have been running away from this problem, tag what are required to use the program's own input, for some lazy people and for the experience of the program, it is hoped to have the automatic generation of article keywords, automatic access to the article tag similar functions, this time in order to meet the new project, so the tinkering for a night, studied this function.
to achieve the function of automatically getting the keyword, can be divided into three steps
1, through the segmentation algorithm, the title and content are separately segmented to extract the key words and frequency.
At present, the main two algorithms are Ictclas and hidden Markov models of CAs. But these two are too high-end, have a certain threshold, and are only support C++/java. Two of the current PHP based PSCWS and HTTPCWS are worth recommending. SCWS released 1.0.0 official edition in 2008-03-08, and the latest version is now 1.0.4. PSCWS is its PHP version. And HTTPCWS is a banquet development, before called PHPCWS. PHPCWS First Use the API "Ictclas 3.0 share version Chinese Word segmentation algorithm" for the first word processing, and then use the "reverse Maximum matching algorithm" to the word segmentation and Word merge processing, and add punctuation filtering function, get word segmentation results. Unfortunately, only Linux systems are currently supported and have not been ported to the win platform.
2, the extraction results are compared with the existing thesaurus to deal with, remove useless words to get the most consistent with the rules of the keyword. This is mainly to see the thesaurus, we can define our own thesaurus, you can use the existing mature thesaurus. For example, Sina and NetEase blog have this function. They should have a good thesaurus, because they are large sites, and I, just a small programmer, it is impossible to get any authoritative thesaurus, so only from the existing open source program to see their thesaurus.
3, in the processing of the extraction results to select the appropriate as the final keyword, the most consistent with the current content of the keyword, at this stage is the specific situation of the specific analysis, in any case it is impossible to achieve the kind of intelligence of people. Max is. The current PHP class CMS has its own extraction keyword system.
At present, the most widely circulated on the network is the dedecms of the source code, I did a test, found quite a stay, the effect is very bad. It first set a keyword length, to determine the number of keywords to obtain, and then take the word, it thought that the title is a good word is the required keyword, in addition to read from the text of the keyword only to achieve this set length, is the final keyword. Another similar "we" and other meaningless words also did not remove the extraction and was listed as the keyword frequency is too high, and even sometimes the space of the HTML proposed as a keyword, urgent need to improve. But if it's an auxiliary function, it's already good. The discuz is slightly better, but Discuz does not provide the source code, just provides an online API.
And Dede Word also has several versions, the best should be the latest version of it, the frequency of what is there, the following is based on the dede5.7 and discuz of the results of the API comparison
Test Example:
$title = "thinkphp official is about to stop supporting the 2.0 version";
$body = "Better thinkphp framework development, maintenance and support work, officially announced from May 1, 2012 to 2.0 and the previous version of the maintenance and support, in order to energy-saving low-carbon considerations, but also cancel the official website of the corresponding version and document download.
In this memory of those years, once developed together thinkphp version!
about the thinkphp 2.0 version
Thinkphp was born in 2006, dedicated to the rapid development of Web applications, its 2.0 release on October 1, 2009, in the previous 1.* version of the new refactoring and Leap, was a landmark version, for the new edition laid the foundation, but also accumulated a lot of user groups and Web sites, With the rapid updating of the framework and the release of the new version 2.1, 2.2 and 3.0, it heralds the arrival of the 3.0 era of thinkphp and the end of the 2.0 life cycle. But most of the 2.0 features have been extended or refined to 2.1, and upgrading from 2.0 to 2.1 and 2.2 is also relatively easy. The 2.2 version is the final version of the 2.* version and is no longer updated and only bug fixes. ";
First, Dede participle
Sort the results as follows
The title Array (
[thinkphp] => 1
[Official] => 1
[Forthcoming] => 1
[Stop] => 1
[To] => 1
[2.0] => 1
[Version] => 1
[of] => 1
[Support] => 1
)
Content Array (
[Version] => 12
[of] => 12
[And] => 8
[Thinkphp] => 5
[2.0] => 5
[Also] => 3
[2.2] => 3
[2.1] => 3
[Development] => 3
[3.0] => 2
[Yes] => 2
[Quick] => 2
[To] => 2
[Publish] => 2
[Maintenance] => 2
[Before] => 2
[The] => 2
[New Edition] => 2
[Support] => 2
[Frame] => 2
[Meanwhile] => 2
[From] => 2
How do you get the keyword out of the final need? The initial idea is to remove the "", "some" these words, and then according to the sort order of the content, in turn to see whether or not appear in the title appear that is required, so you can take out a quantitative word the most final keyword. As a result, we can get
Version thinkphp 2.0 support Stop
Five key words. It seems that the results are acceptable.
Second, in view of the Discuz, the use of the API is an XML document, parsed after the keyword is
, rapid, version upgrade, development, user
Five words, the first one is "the" ...
Contrast these two ways to find the first dede+ post-processing more closely to the content of the document, it should be slightly better, and discuz deviation from the theme of the article, but its access to the word has a certain popularity

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.