Python3 uses OPENPYXL and Jieba to extract keywords from the posts--frequency statistics of the extracted keywords

Last Update:2018-04-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

20180413 Study Notes

First, the work

The day before yesterday in the post of the keywords extracted storage, found a problem. I seem to put each keyword into a separate cell, so that at the end of the total frequency of statistics, it is not good to deal with. So, the last style:

is not possible, they should be placed in the same column (row), assembled into a list or a tuple and then the word frequency statistics.

1. Read the output file "T1.xlsx"

wr2=load_workbook(‘t1.xlsx‘)cursheet=wr2.active #当前表单

2. Read all the data in the table to l[]

Replaced the previous 10 data "biao.xlsx" with more than 300 posts "biao2.xlsx"

L=[]for row in cursheet.rows:    for cell in row:        L.append(cell.value)

Output look:

Not finished

Look at the overall effect is good.

3. Word Frequency statistics

Use the counter function to carry out word frequency statistics for L.

Need to import counterfrom collections import Counter

#新开一个xlsx文件，将词频统计结果写入ww2=Workbook()sheet3 =ww2.activesheet3.title="statis"#Counter返回的的是一个dictLC=Counter(L)

Output look:

#但是这样存储起来无法使用，所以需要排一下序，sorted返回的是一个二维listLC2=sorted(LC.items(),key=lambda d:d[1],reverse=True)

Output look:

In order for each element in a two-dimensional array to be stored in two columns in an Excel table, the elements need to be disassembled. So I found the relevant documentation for PYTHON3:

Https://docs.python.org/2/tutorial/datastructures.html#list-comprehensions

To detach one layer of it:

#用n来计数，奇数存储（str,num)中的前面字符串，偶数存储后面的出现次数，感觉这样做很蠢。。。但是暂时还不能像c那样熟练地使用c1=1for k in LC2:    n=1    for v in k:        if n%2==1:            sheet3["A%d" % c1].value=v            n=n+1        else:            sheet3["B%d" % c1].value=v    c1=c1+1    ww2.save(‘sta.xlsx‘)

Let's take a look at the effect:

So the overall is done.

Ii. Summary and Reflection

Obviously, although the statistical word frequency is complete, but the feeling effect is general. In particular, when the vocabulary is recorded, some stemless words are counted in together. such as inexplicable "http", there are other English, may be crawled down the name of the Netizen, such as, in the callout when too many references to its related posts, resulting in the frequency is too high. There are also some blank symbols, jieba why the white space character as a word, strange ...

Third, the next task

The next task is to analyze the statistical terms and to deal with them. Using regular expressions, we can filter out the Chinese words we need and then count them again.

The rest is the machine learning content.

Python3 uses OPENPYXL and Jieba to extract keywords from the posts--frequency statistics of the extracted keywords

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python3 uses OPENPYXL and Jieba to extract keywords from the posts--frequency statistics of the extracted keywords

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python3 uses OPENPYXL and Jieba to extract keywords from the posts--frequency statistics of the extracted keywords

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support