Statistics by R Language (ii) the frequency of words appearing

Source: Internet
Author: User

Statistics by R Language (ii) the frequency of words appearing

We are not unfamiliar with the English exam, first of all is to recite the words, that is, the so-called high-frequency vocabulary. The thick of a word, really look at the head big. Recently combined with their own newly-learned R language, for the end of the year to prepare for the grind, want to statistics about the recent postgraduate study English (ii) the number of real words appear frequency.

Overall idea:

Collect data--organize data--statistical analysis--output results

Using tools:

`Rstudio,文本编辑器,CSV`

The package involved: "Jiebar" (Chinese word breaker), "Plyr",

The first step is to collect the data:
从网络搜索2013-2018考研英语二真题,存成txt格式。

Second step to organize the data

Simple collation for each file, eliminating unnecessary text. For example: "2017 National Postgraduate Entrance Examination English", "answer", or garbled or something. Done by hand.

The third step: statistical analysis

3.1 Open the R language to install the required packages

    Install.packages ("jiebard"# before installing Jiebar follow "Jiebard"    Install.packages ("jiebar")    install.packages ("Plyr  ")  --Loading package--    Library (jiebard) library (    Jiebar)    Library (PLYR)     # view packages that have been followed

Search () [1] ". Globalenv "" Package:xlsx "
[3] "Package:xlsxjars" "Package:rjava"
[5] "Package:wordcloud" "Package:rcolorbrewer" [7] "Package:plyr" "Package:jiebar"
[9] "Package:jiebard" "Tools:rstudio"
[One] "package:stats" "Package:graphics"
[] "package:grdevices" "Package:utils"
[] "Package:datasets" "Package:methods"
[+] "autoloads" "Package:base"

3.2 Loading files, analyzing

SETWD ("d:/r")#settings File root directory--Loading Files test_file_2018<-ReadLines ("2018.txt", encoding ="UTF-8")#read the file, the encoding format is "UTF-8"test_file_2017 <-ReadLines ("2017.txt", encoding ="UTF-8") test_file_2016<-ReadLines ("2016.txt", encoding ="UTF-8") test_file_2015<-ReadLines ("2015.txt", encoding ="UTF-8") test_file_2014<-ReadLines ("2014.txt", encoding ="UTF-8") test_file_2013<-ReadLines ("2013.txt", encoding ="UTF-8")--merging a file uses C () to make a vector of multiple elements. Test_file<-C (test_file_2018,test_file_2017,test_file_2016,test_file_2015,test_file_2014,test_file_2013) Test_file<-tolower (Test_file)#Convert all characters to lowercaseCutter=worker ()#setting up the word Breakers engineSegwords <-Segment (test_file,cutter)#word processing for text--Set the pause word here is actually filter word, a line a word, some self think very simple words, for example: The option of a,b,c,d,the, and, an and so on, or first filter this step, until the statistical frequency, in accordance with the need to add. Build a file in the same directory"Stopword.txt"F<-ReadLines ("Stopword.txt") Stopwords<-C (NULL) for(Iinch1: Length (f)) {Stopwords[i]<-F[i]}segwords<-filter_segment (segwords,stopwords)#filter words, filter_segment (source text, filtered words)Segwords<-gsub ("[0-9[:p unct:]]+?","", Segwords)#Removing the number 0-9 means the number, [:p UNCT:] represents the special character "!" # $% & ' () * +,-./:; < = >? @ [\] ^ _ ' {|} ~ 'Tableword<-count (segwords)#Statistical FrequencyView (Tableword)

停顿词示例stopword.txt:

Fourth step, output result
Write.csv (Tableword,"tableword.csv""UTF-8")#

Reference Source: 46730801



Statistics by R Language (ii) the frequency of words appearing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.