Statistics by R Language (ii) the frequency of words appearing
We are not unfamiliar with the English exam, first of all is to recite the words, that is, the so-called high-frequency vocabulary. The thick of a word, really look at the head big. Recently combined with their own newly-learned R language, for the end of the year to prepare for the grind, want to statistics about the recent postgraduate study English (ii) the number of real words appear frequency.
Overall idea:
Collect data--organize data--statistical analysis--output results
Using tools:
`Rstudio,文本编辑器,CSV`
The package involved: "Jiebar" (Chinese word breaker), "Plyr",
The first step is to collect the data:
从网络搜索2013-2018考研英语二真题,存成txt格式。
Second step to organize the data
Simple collation for each file, eliminating unnecessary text. For example: "2017 National Postgraduate Entrance Examination English", "answer", or garbled or something. Done by hand.
The third step: statistical analysis
3.1 Open the R language to install the required packages
Install.packages ("jiebard"# before installing Jiebar follow "Jiebard" Install.packages ("jiebar") install.packages ("Plyr ") --Loading package-- Library (jiebard) library ( Jiebar) Library (PLYR) # view packages that have been followed
Search () [1] ". Globalenv "" Package:xlsx "
[3] "Package:xlsxjars" "Package:rjava"
[5] "Package:wordcloud" "Package:rcolorbrewer" [7] "Package:plyr" "Package:jiebar"
[9] "Package:jiebard" "Tools:rstudio"
[One] "package:stats" "Package:graphics"
[] "package:grdevices" "Package:utils"
[] "Package:datasets" "Package:methods"
[+] "autoloads" "Package:base"
3.2 Loading files, analyzing
SETWD ("d:/r")#settings File root directory--Loading Files test_file_2018<-ReadLines ("2018.txt", encoding ="UTF-8")#read the file, the encoding format is "UTF-8"test_file_2017 <-ReadLines ("2017.txt", encoding ="UTF-8") test_file_2016<-ReadLines ("2016.txt", encoding ="UTF-8") test_file_2015<-ReadLines ("2015.txt", encoding ="UTF-8") test_file_2014<-ReadLines ("2014.txt", encoding ="UTF-8") test_file_2013<-ReadLines ("2013.txt", encoding ="UTF-8")--merging a file uses C () to make a vector of multiple elements. Test_file<-C (test_file_2018,test_file_2017,test_file_2016,test_file_2015,test_file_2014,test_file_2013) Test_file<-tolower (Test_file)#Convert all characters to lowercaseCutter=worker ()#setting up the word Breakers engineSegwords <-Segment (test_file,cutter)#word processing for text--Set the pause word here is actually filter word, a line a word, some self think very simple words, for example: The option of a,b,c,d,the, and, an and so on, or first filter this step, until the statistical frequency, in accordance with the need to add. Build a file in the same directory"Stopword.txt"F<-ReadLines ("Stopword.txt") Stopwords<-C (NULL) for(Iinch1: Length (f)) {Stopwords[i]<-F[i]}segwords<-filter_segment (segwords,stopwords)#filter words, filter_segment (source text, filtered words)Segwords<-gsub ("[0-9[:p unct:]]+?","", Segwords)#Removing the number 0-9 means the number, [:p UNCT:] represents the special character "!" # $% & ' () * +,-./:; < = >? @ [\] ^ _ ' {|} ~ 'Tableword<-count (segwords)#Statistical FrequencyView (Tableword)
停顿词示例stopword.txt:
Fourth step, output result
Write.csv (Tableword,"tableword.csv""UTF-8")#
Reference Source: 46730801
Statistics by R Language (ii) the frequency of words appearing