Currently, CMS comes with the collection function. the content and title are relatively easy to process, but in most cases it is difficult to extract keywords. Therefore, automatically obtaining keywords becomes a "traditional problem" in the current php cms class ". How can we automatically obtain keywords? the main steps are as follows:
Currently, CMS comes with the collection function. the content and title are relatively easy to process, but in most cases it is difficult to extract keywords. Therefore, automatically obtaining keywords becomes a "traditional problem" in the current php cms class ". Then how can we automatically obtain keywords? the main steps are as follows:
1. the title and content are separated by word segmentation algorithms to extract keywords and frequency in the content segmentation phase. Currently, the two main algorithms are the ICTCLAS and hidden Markov models of the Chinese Emy of Sciences. However, both of them are too high-end and have a certain threshold, and both of them only support C ++/JAVA. Currently, two PHP-based PSCWS and HTTPCWS are recommended. SCWS released the official version 1.0.0 on, and now the latest version has reached 1.0.4. PSCWS is its PHP version. HTTPCWS was developed by Zhang banquet. it was previously called PHPCWS. PHPCWS first uses the "ICTCLAS 3.0 shared Chinese word segmentation algorithm" API for initial word segmentation, and then uses the self-compiled "inverse maximum matching algorithm" to merge word segmentation and words, the punctuation filtering function is added to obtain word segmentation results. Currently, only Linux/Unix systems are supported.
2. compare the extracted results with the existing dictionary to obtain the most compliant keywords. here we mainly want to see the dictionary. we can define the dictionary ourselves or use the existing mature Dictionary.
3. then compare the two sets of keywords to obtain the keyword that best matches the current content. at this stage, the specific situation is analyzed. Currently, php cms has its own keyword extraction system. Among them, DEDECMS's word segmentation source code is widely circulated on the network. I have also tested it on my POPCMS, and the results are very good, however, meaningless words such as "we" are frequently extracted and listed as keywords. sometimes, HTML with spaces is also proposed as keywords, which must be improved. However, if it is used as an auxiliary function, it is quite good. In addition, the Automatic Keyword extraction function of PHPCMS and DISCUZ is also very powerful.