PHP to achieve automatic access to generate Article topic keyword function in-depth analysis

PHP to achieve automatic access to generate Article topic keyword function in-depth analysis _php example

Last Update:2017-01-19 Source: Internet

Author: User

Tags php class

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Previously written programs have been running away from this problem, tag what are required to use the program's own input, for some lazy people and for the experience of the program, it is hoped to have the automatic generation of article keywords, automatic access to the article tag similar functions, this time in order to meet the new project, so the tinkering for a night, studied this function.
to achieve the function of automatically getting the keyword, can be divided into three steps
1, through the segmentation algorithm, the title and content are separately segmented to extract the key words and frequency. At present, the main two algorithms are Ictclas and hidden Markov models of CAs. But these two are too high-end, have a certain threshold, and are only support C++/java. Two of the current PHP based PSCWS and HTTPCWS are worth recommending. SCWS released 1.0.0 official edition in 2008-03-08, and the latest version is now 1.0.4. PSCWS is its PHP version. And HTTPCWS is a banquet development, before called PHPCWS. PHPCWS First Use the API "Ictclas 3.0 share version Chinese Word segmentation algorithm" for the first word processing, and then use the "reverse Maximum matching algorithm" to the word segmentation and Word merge processing, and add punctuation filtering function, get word segmentation results. Unfortunately, only Linux systems are currently supported and have not been ported to the win platform.
2, the extraction results are compared with the existing thesaurus to deal with, remove useless words to get the most consistent with the rules of the keyword. This is mainly to see the thesaurus, we can define our own thesaurus, you can use the existing mature thesaurus. For example, Sina and NetEase blog have this function. They should have a good thesaurus, because they are large sites, and I, just a small programmer, it is impossible to get any authoritative thesaurus, so only from the existing open source program to see their thesaurus.
3, in the processing of the extraction results to select the appropriate as the final keyword, the most consistent with the current content of the keyword, at this stage is the specific situation of the specific analysis, in any case it is impossible to achieve the kind of intelligence of people. Max is. The current PHP class CMS has its own extraction keyword system.
At present, the most widely circulated on the network is the dedecms of the source code, I did a test, found quite a stay, the effect is very bad. It first set a keyword length, to determine the number of keywords to obtain, and then take the word, it thought that the title is a good word is the required keyword, in addition to read from the text of the keyword only to achieve this set length, is the final keyword. Another similar "we" and other meaningless words also did not remove the extraction and was listed as the keyword frequency is too high, and even sometimes the space of the HTML proposed as a keyword, urgent need to improve. But if it's an auxiliary function, it's already good. The discuz is slightly better, but Discuz does not provide the source code, just provides an online API.
And Dede Word also has several versions, the best should be the latest version of it, the frequency of what is there, the following is based on the dede5.7 and discuz of the results of the API comparison
Test Example:
$title = "thinkphp official is about to stop supporting the 2.0 version";
$body = "Better thinkphp framework development, maintenance and support work, officially announced from May 1, 2012 to 2.0 and the previous version of the maintenance and support, in order to energy-saving low-carbon considerations, but also cancel the official website of the corresponding version and document download.
In this memory of those years, once developed together thinkphp version!
about the thinkphp 2.0 version
Thinkphp was born in 2006, dedicated to the rapid development of Web applications, its 2.0 release on October 1, 2009, in the previous 1.* version of the new refactoring and Leap, was a landmark version, for the new edition laid the foundation, but also accumulated a lot of user groups and Web sites, With the rapid updating of the framework and the release of the new version 2.1, 2.2 and 3.0, it heralds the arrival of the 3.0 era of thinkphp and the end of the 2.0 life cycle. But most of the 2.0 features have been extended or refined to 2.1, and upgrading from 2.0 to 2.1 and 2.2 is also relatively easy. The 2.2 version is the final version of the 2.* version and is no longer updated and only bug fixes. ";
First, Dede participle
Sort the results as follows
The title Array (
[thinkphp] => 1
[Official] => 1
[Forthcoming] => 1
[Stop] => 1
[To] => 1
[2.0] => 1
[Version] => 1
[of] => 1
[Support] => 1
)
Content Array (
[Version] => 12
[of] => 12
[And] => 8
[Thinkphp] => 5
[2.0] => 5
[Also] => 3
[2.2] => 3
[2.1] => 3
[Development] => 3
[3.0] => 2
[Yes] => 2
[Quick] => 2
[To] => 2
[Publish] => 2
[Maintenance] => 2
[Before] => 2
[The] => 2
[New Edition] => 2
[Support] => 2
[Frame] => 2
[Meanwhile] => 2
[From] => 2
How do you get the keyword out of the final need? The initial idea is to remove the "", "some" these words, and then according to the sort order of the content, in turn to see whether or not appear in the title appear that is required, so you can take out a quantitative word the most final keyword. As a result, we can get
Version thinkphp 2.0 support Stop
Five key words. It seems that the results are acceptable.
Second, in view of the Discuz, the use of the API is an XML document, parsed after the keyword is
, rapid, version upgrade, development, user
Five words, the first one is "the" ...
Contrast these two ways to find the first dede+ post-processing more closely to the content of the document, it should be slightly better, and discuz deviation from the theme of the article, but its access to the word has a certain popularity

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More