A statistical example of Chinese participle and word frequency _

A statistical example of Chinese participle and word frequency __nlp

Last Update:2018-08-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

http://blog.ourren.com/2014/09/24/chinese_token_and_frequency/

Say nearly two years big data really fire, bring us the most direct visual feeling is to use diagram or table to show large data hidden content, it is true and intuitive. However, the sidebar tag cloud of technical blogs is a primitive prototype, except that the label is generated by the author's manual addition. This article is to automatically extract the key words in the blog post title, and then through the plug-in to display. The core technology is: Chinese participle and frequency statistics.

About Chinese participle

Chinese and English in the word segmentation technology is more different, Chinese characters can often be composed of multiple words, and words can also be abbreviated. For example: Suzhou, Hangzhou can be abbreviated as Xuzhou and Suzhou, while English is relatively fixed, a word is a word. It is no wonder that words in Word error reminding function can not completely correct the Chinese writing errors, but to correct all the errors in English writing.

At present, there are several popular libraries for Chinese participle: Jieba: Based on the development of Python language, at the same time the first contact is also the open Source Library, the overall is also good, GitHub is also the highest concern; Yaha: It's based on the Python language, and it's similar to Jieba, Seemingly functional changes; Nlpir: developing language C/c++/c#/java; Other participle can refer to reference link 1;

Can say that these word-breakers have different, choose a suitable for their own library on the line, the choice is Jieba, mainly because of the development based on Python, and in the domestic participle of high popularity (V2ex original theme tag is based on Jieba).

Extract all Headers

Export the data in the Wp-post table in the database to CSV format, and then use Excel to open and extract the title column in txt format, so that all the headings are in the TXT file.

Title participle

The use of Jieba to the case of the title participle, in fact, all the title as a word for a word, but in order to better accurate, or select a single title for participle, directly on the code, very brief: #encoding =utf-8 import Jieba Wordsall = {} #define return dic postfile = open (' title ', ' r ') Ptitle = Postfile.readlines () to ititle in ptitle:ititle = Ititle.re Place (' \ n ', ') #clean \ n seg_list = Jieba.cut (Ititle, cut_all=false) print "". Join (Seg_list)

This way you can go straight to the title of all the participle and print out. In fact, Jieba in the time there are three types of participle (can refer to link 2), and this is the full mode: Accurate mode, trying to cut the most accurate sentences, suitable for text analysis, the whole model, all the words can be translated into words are scanned, speed is very fast, but can not solve ambiguity; search engine mode, On the basis of the exact pattern, the long word segmentation again, improve the recall rate, suitable for search engine participle.

In fact, the code can also write a better point, statistical keywords and statistics of the number of occurrences of each word, the code is as follows: #encoding =utf-8 import Jieba wordsall = {} #define return dic postfile = open (' title ') , ' r ') Ptitle = Postfile.readlines () for ititle in ptitle:ititle = Ititle.replace (' \ n ', ") #clean \ n seg_list = jieba.cut (i Title, cut_all=false) rowlist = "". Join (seg_list) words = Rowlist.split (") for word in Words:if word!=": if Word in W Ordsall:wordsall[word]+=1 Else:wordsall[word] = 1 Wordsall = sorted (Wordsall.items (), Key=lambda d:d[1], reverse = True) For (word, CNT) in Wordsall:print "%s:"% word,cnt

Text illustration shows

In fact, in the process of the above although the word of the title, but the General people still do not understand what this blog mainly write what content, and the current very popular text map is more convenient for ordinary people to quickly understand the relevant information. This kind of main use on-line service realizes the drawing of the text drawing, the on-line platform only needs to provide the key word to display the text graph, this time uses the Worditout on-line service to produce;

Copy the words generated above into the Worditout text box, on the generation of the most front of the graph, compared to find the effect of this text map and persuasive with my manual tagging of the label cloud is similar, that is to say that automatic label segmentation technology is basically mature, do not need manual tagging manually.

In fact, there are many other online text map generation system, you can refer to an article here;

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

A statistical example of Chinese participle and word frequency __nlp

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

A statistical example of Chinese participle and word frequency __nlp

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support