PHP redemption auto Get generate keyword function

Source: Internet
Author: User
PHP implementation automatically generate keyword function
previously written programs have been avoiding this problem, tag what is required to use the program's self-input, for some lazy people and for the experience of the program, it is hoped to have automatic generation of article keywords, automatic access to the article tag similar function, this time in order to meet the new project, so tinkering a night, studied this feature.
to achieve the ability to automatically get keywords, you can probably divide into three steps
1, the title and content are segmented separately by the word segmentation algorithm, and the keywords and frequency are extracted. At present, the main two algorithms are Ictclas and hidden Markov models of CAs. But these two are too high-end, have a certain threshold, and are only support C++/java. There are currently two PHP-based PSCWS and HTTPCWS that are worth recommending. SCWS released the 1.0.0 official version in 2008-03-08, and the latest version has now reached 1.0.4. PSCWS is the PHP version of it. And Httpcws is a feast developed, before called PHPCWS. PHPCWS First Use the "Ictclas 3.0 share Chinese word segmentation algorithm" API for the first word processing, and then use the self-written "inverse maximum matching algorithm" for word segmentation and Word merging, and increase the punctuation filtering function, to obtain the results of the word segmentation. Unfortunately, only Linux systems are currently supported, but not yet ported to the win platform.
2, the extraction results and the existing thesaurus to compare, processing, removing useless words to get the most consistent with the rules of the keyword. The main thing here is to look at the thesaurus, we can define our own thesaurus, we can also use the existing mature thesaurus. For example, Sina and NetEase blog all have this function. They should have a good thesaurus, because they are large sites, and I, just a small programmer, it is impossible to get any authoritative thesaurus, so only from the existing open source program to start, look at their thesaurus.
3, in the process of extracting the results of the selection of appropriate as the final keyword, to get the most consistent with the current content of the key words, at this stage is the specific situation of the specific analysis, in any case can not reach the kind of intelligent people. Most of it. The current PHP class CMS has its own extraction keyword system.
At present, the most widely circulated in the network is the dedecms of the word source, I did a test, found that quite stay, the effect is very bad. It first set a keyword length, determine the number of keywords to get, and then take the word, it thought the title is good words is the required keyword, in addition to read from the text of the keyword only to reach the length of the set, is the ultimate keyword. In addition, such as "we" and other meaningless words have not been removed and be listed as the frequency of the keyword is too high, and even sometimes put the space of the HTML as a key word, urgent need to improve. But if it's an auxiliary feature, it's good enough. And Discuz is slightly better, but Discuz does not provide the source code, just provides an online API.
and Dede's participle also has several versions, the best should be the latest version of it, the frequency of everything has, the following is the dede5.7 of the word and Discuz API results comparison
Test Examples:

    1. $title="thinkphp official is about to stop support for version 2.0";
    2. $body="To better develop, maintain and support the thinkphp framework, the official announces the maintenance and support for the 2.0 and previous versions from May 1, 2012 onwards, to save energy and low carbon, and also to cancel the corresponding version of the official website and document download.
    3. This is the memory of those years, the development of the thinkphp version of it!
    4. About thinkphp version 2.0
    5. Thinkphp was born in 2006, dedicated to the rapid development of Web applications, its 2.0 release on October 1, 2009, in the previous 1.* version of the new refactoring and Leap, was an epoch-making version, the foundation for the new edition, Also accumulated a lot of user groups and sites, with the rapid updating of the framework, and the release of the new version 2.1, 2.2 and 3.0, heralding the arrival of the thinkphp of the 3.0 era, 2.0 of the life cycle of the end. But basically 2.0 of many features are extended or perfected to version 2.1, and upgrading from 2.0 to 2.1 and 2.2 is also relatively easy. Version 2.2 is the final version of the 2.* version, no longer features updated, only bug fixes. ";

first, Dede participle
sort the results as follows

  1. title Array
  2. (
  3. [thinkphp] = 1
  4. [Official] = 1
  5. [Upcoming] = 1
  6. [Stop] = 1
  7. [to] = 1
  8. [2.0] = 1
  9. [Version] = 1
  10. [of] = 1
  11. [Support] = 1
  12. )
  13. content Array
  14. (
  15. [Version] = A
  16. [the] = +
  17. [and] = 8
  18. [thinkphp] = 5
  19. [2.0] = 5
  20. [also] = 3
  21. [2.2] = 3
  22. [2.1] = 3
  23. [Development] = 3
  24. [3.0] = 2
  25. [Yes] = 2
  26. [Quick] = 2
  27. [to] = 2
  28. [Release] = 2
  29. [Maintenance] = 2
  30. [Previous] = 2
  31. [up] = 2
  32. [New] = 2
  33. [Support] = 2
  34. [frame] = 2
  35. [at the same time] = 2
  36. [from] = 2
  37. *******

How do you take out the key words for the final need? The initial idea is to remove the words "," "some", and then according to the sort order of the content, in turn, to see if the title appears in the heading is required, so that you can take out a quantitative word the most final keywords. As a result, we can get

    1. version thinkphp 2.0 support stop

five keywords. It seems that the results are acceptable.
second, in view of Discuz, using the API to get an XML document, the key words to be parsed is

    1. , rapid, version upgrade, development, user


Five words, the first one is "the" ...
Comparing these two ways to find the first dede+ follow-up processing of the more close to the content of the document, should be slightly better, and discuz deviate from the subject of the article, but its access to the word has a certain popularity.

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.