Python3 uses OPENPYXL and Jieba to extract keywords from the posts--frequency statistics of the extracted keywords
20180413 Study Notes
First, the work
The day before yesterday in the post of the keywords extracted storage, found a problem. I seem to put each keyword into a separate cell, so that at the end of the total frequency of statistics, it is not good to deal with. So, the last style:
is not possible, they should be placed in the same column (row), assembled into a list or a tuple and then the word frequency statistics.
1. Read the output file "T1.xlsx"
wr2=load_workbook(‘t1.xlsx‘)cursheet=wr2.active #当前表单
2. Read all the data in the table to l[]
Replaced the previous 10 data "biao.xlsx" with more than 300 posts "biao2.xlsx"
L=[]for row in cursheet.rows: for cell in row: L.append(cell.value)
Output look:
Not finished
Look at the overall effect is good.
3. Word Frequency statistics
Use the counter function to carry out word frequency statistics for L.
Need to import counterfrom collections import Counter
#新开一个xlsx文件,将词频统计结果写入ww2=Workbook()sheet3 =ww2.activesheet3.title="statis"#Counter返回的的是一个dictLC=Counter(L)
Output look:
#但是这样存储起来无法使用,所以需要排一下序,sorted返回的是一个二维listLC2=sorted(LC.items(),key=lambda d:d[1],reverse=True)
Output look:
In order for each element in a two-dimensional array to be stored in two columns in an Excel table, the elements need to be disassembled. So I found the relevant documentation for PYTHON3:
Https://docs.python.org/2/tutorial/datastructures.html#list-comprehensions
To detach one layer of it:
#用n来计数,奇数存储(str,num)中的前面字符串,偶数存储后面的出现次数,感觉这样做很蠢。。。但是暂时还不能像c那样熟练地使用c1=1for k in LC2: n=1 for v in k: if n%2==1: sheet3["A%d" % c1].value=v n=n+1 else: sheet3["B%d" % c1].value=v c1=c1+1 ww2.save(‘sta.xlsx‘)
Let's take a look at the effect:
So the overall is done.
Ii. Summary and Reflection
Obviously, although the statistical word frequency is complete, but the feeling effect is general. In particular, when the vocabulary is recorded, some stemless words are counted in together. such as inexplicable "http", there are other English, may be crawled down the name of the Netizen, such as, in the callout when too many references to its related posts, resulting in the frequency is too high. There are also some blank symbols, jieba why the white space character as a word, strange ...
Third, the next task
The next task is to analyze the statistical terms and to deal with them. Using regular expressions, we can filter out the Chinese words we need and then count them again.
The rest is the machine learning content.
Python3 uses OPENPYXL and Jieba to extract keywords from the posts--frequency statistics of the extracted keywords