http://blog.ourren.com/2014/09/24/chinese_token_and_frequency/
Say nearly two years big data really fire, bring us the most direct visual feeling is to use diagram or table to show large data hidden content, it is true and intuitive. However, the sidebar tag cloud of technical blogs is a primitive prototype, except that the label is generated by the author's manual addition. This article is to automatically extract the key words in the blog post title, and then through the plug-in to display. The core technology is: Chinese participle and frequency statistics.
About Chinese participle
Chinese and English in the word segmentation technology is more different, Chinese characters can often be composed of multiple words, and words can also be abbreviated. For example: Suzhou, Hangzhou can be abbreviated as Xuzhou and Suzhou, while English is relatively fixed, a word is a word. It is no wonder that words in Word error reminding function can not completely correct the Chinese writing errors, but to correct all the errors in English writing.
At present, there are several popular libraries for Chinese participle: Jieba: Based on the development of Python language, at the same time the first contact is also the open Source Library, the overall is also good, GitHub is also the highest concern; Yaha: It's based on the Python language, and it's similar to Jieba, Seemingly functional changes; Nlpir: developing language C/c++/c#/java; Other participle can refer to reference link 1;
Can say that these word-breakers have different, choose a suitable for their own library on the line, the choice is Jieba, mainly because of the development based on Python, and in the domestic participle of high popularity (V2ex original theme tag is based on Jieba).
Extract all Headers
Export the data in the Wp-post table in the database to CSV format, and then use Excel to open and extract the title column in txt format, so that all the headings are in the TXT file.
Title participle
The use of Jieba to the case of the title participle, in fact, all the title as a word for a word, but in order to better accurate, or select a single title for participle, directly on the code, very brief: #encoding =utf-8 import Jieba Wordsall = {} #define return dic postfile = open (' title ', ' r ') Ptitle = Postfile.readlines () to ititle in ptitle:ititle = Ititle.re Place (' \ n ', ') #clean \ n seg_list = Jieba.cut (Ititle, cut_all=false) print "". Join (Seg_list)
This way you can go straight to the title of all the participle and print out. In fact, Jieba in the time there are three types of participle (can refer to link 2), and this is the full mode: Accurate mode, trying to cut the most accurate sentences, suitable for text analysis, the whole model, all the words can be translated into words are scanned, speed is very fast, but can not solve ambiguity; search engine mode, On the basis of the exact pattern, the long word segmentation again, improve the recall rate, suitable for search engine participle.
In fact, the code can also write a better point, statistical keywords and statistics of the number of occurrences of each word, the code is as follows: #encoding =utf-8 import Jieba wordsall = {} #define return dic postfile = open (' title ') , ' r ') Ptitle = Postfile.readlines () for ititle in ptitle:ititle = Ititle.replace (' \ n ', ") #clean \ n seg_list = jieba.cut (i Title, cut_all=false) rowlist = "". Join (seg_list) words = Rowlist.split (") for word in Words:if word!=": if Word in W Ordsall:wordsall[word]+=1 Else:wordsall[word] = 1 Wordsall = sorted (Wordsall.items (), Key=lambda d:d[1], reverse = True) For (word, CNT) in Wordsall:print "%s:"% word,cnt
Text illustration shows
In fact, in the process of the above although the word of the title, but the General people still do not understand what this blog mainly write what content, and the current very popular text map is more convenient for ordinary people to quickly understand the relevant information. This kind of main use on-line service realizes the drawing of the text drawing, the on-line platform only needs to provide the key word to display the text graph, this time uses the Worditout on-line service to produce;
Copy the words generated above into the Worditout text box, on the generation of the most front of the graph, compared to find the effect of this text map and persuasive with my manual tagging of the label cloud is similar, that is to say that automatic label segmentation technology is basically mature, do not need manual tagging manually.
In fact, there are many other online text map generation system, you can refer to an article here;