Python3 uses OPENPYXL and Jieba to extract keywords from the posts--frequency statistics of the extracted keywords

Source: Internet
Author: User

Python3 uses OPENPYXL and Jieba to extract keywords from the posts--frequency statistics of the extracted keywords

20180413 Study Notes

First, the work

The day before yesterday in the post of the keywords extracted storage, found a problem. I seem to put each keyword into a separate cell, so that at the end of the total frequency of statistics, it is not good to deal with. So, the last style:

is not possible, they should be placed in the same column (row), assembled into a list or a tuple and then the word frequency statistics.

1. Read the output file "T1.xlsx"
wr2=load_workbook(‘t1.xlsx‘)cursheet=wr2.active #当前表单
2. Read all the data in the table to l[]

Replaced the previous 10 data "biao.xlsx" with more than 300 posts "biao2.xlsx"

L=[]for row in cursheet.rows:    for cell in row:        L.append(cell.value)

Output look:

Not finished

Look at the overall effect is good.

3. Word Frequency statistics

Use the counter function to carry out word frequency statistics for L.

Need to import counterfrom collections import Counter

#新开一个xlsx文件,将词频统计结果写入ww2=Workbook()sheet3 =ww2.activesheet3.title="statis"#Counter返回的的是一个dictLC=Counter(L)

Output look:

#但是这样存储起来无法使用,所以需要排一下序,sorted返回的是一个二维listLC2=sorted(LC.items(),key=lambda d:d[1],reverse=True)

Output look:

In order for each element in a two-dimensional array to be stored in two columns in an Excel table, the elements need to be disassembled. So I found the relevant documentation for PYTHON3:

Https://docs.python.org/2/tutorial/datastructures.html#list-comprehensions

To detach one layer of it:

#用n来计数,奇数存储(str,num)中的前面字符串,偶数存储后面的出现次数,感觉这样做很蠢。。。但是暂时还不能像c那样熟练地使用c1=1for k in LC2:    n=1    for v in k:        if n%2==1:            sheet3["A%d" % c1].value=v            n=n+1        else:            sheet3["B%d" % c1].value=v    c1=c1+1    ww2.save(‘sta.xlsx‘)

Let's take a look at the effect:

So the overall is done.

Ii. Summary and Reflection

Obviously, although the statistical word frequency is complete, but the feeling effect is general. In particular, when the vocabulary is recorded, some stemless words are counted in together. such as inexplicable "http", there are other English, may be crawled down the name of the Netizen, such as, in the callout when too many references to its related posts, resulting in the frequency is too high. There are also some blank symbols, jieba why the white space character as a word, strange ...

Third, the next task

The next task is to analyze the statistical terms and to deal with them. Using regular expressions, we can filter out the Chinese words we need and then count them again.

The rest is the machine learning content.

Python3 uses OPENPYXL and Jieba to extract keywords from the posts--frequency statistics of the extracted keywords

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.