1. English Document Frequency Statistics
English document Word frequency in English original Alice in Wonderland as an example, statistics of each words in the whole novel frequency , and according to frequency from large to small sorting . Because the whole book contains more words, in order to facilitate the display, only the word frequency is greater than ten words.
The code looks like this:
#-*-Coding:utf-8-*-
"" "
Created on Thu June 21:13:17 2017
@author: Zch" ""
Import String
#读取英文原著alice
Path = ' e:/python/data/nlp/alice.txt '
with open (path, ' R ', encoding= ' utf-8 ') as text:
# Convert all English letters to lowercase
words = [Raw_word.strip (string.punctuation). Lower () for Raw_word in Text.read (). Split ()]
# Convert to set form
Words_index = set (words)
#使用字典统计词频
counts_dict = {index:words.count (index) for index in Words_ Index}
#按照词频从高到低排序 to
word in sorted (Counts_dict,key=lambda x:counts_dict[x],reverse=true):
if Counts_dict[word] >:
print (' {}--{} times '. Format (Word,counts_dict[word]))
The
Output is shown in the following illustration:
You can see that the top 10 words in the book are: " the ", " and ", " to " "," a "," She "," it "," of "," said , i , Alice . 2. Chinese document frequency Statistics