Purpose of writing
Recently, because of the research needs, the text has been subject to discovery using the R language, and the following is a record of the specific process.
Step one: Read the text and preprocess it
In this experiment, we mainly analyze the index records of Bigdata from SCI Citation database, the file name is Download_2.txt directory c:\\data\\, the specific code is:
#文件路径textfile <-"C:\\data\\download_1.txt" #按行读取该文本到变量bigdatabigdata <-readlines (textfile) # Use regular expressions to extract summaries from records doc<-grep ("^ab.*?", Bigdata) #删除文件开头的AB characters, get a summary list doc<-sub ("^ab", "", Doc)
Step two: Build a documenttermmatrix matrix using a TM Package
After you read the digest information to the DOC variable, you then need to use the TM package to process the text.
#加载tm包library (tm) #建立语料库doc. Vec<-vectorsource (Doc) Doc.corpus<-corpus (Doc.vec) #进行预处理doc. Corpus<-tm_map ( Doc.corpus,tolower) Doc.corpus<-tm_map (doc.corpus,removepunctuation) doc.corpus<-tm_map (Doc.corpus, RemoveNumbers) Doc.corpus<-tm_map (Doc.corpus,removewords, Stopwords ("中文版")) #加载SnowballC包library (SNOWBALLC ) #继续进行预处理doc. Corpus <-Tm_map (Doc.corpus, stemdocument) Doc.corpus <-tm_map (Doc.corpus, Stripwhitespace) # Established TERMDOCUMENTMATRIXTDM <-Termdocumentmatrix (Doc.corpus)
Step three: Use Wordcloud to see the distribution of words
Library (Wordcloud) m <-As.matrix (TDM) v <-sort (rowsums (m), decreasing=true) d <-data.frame (word = names (v), FREQ=V) Wordcloud (D$word,d$freq,c (8,.3), 2)
After the above steps, you can get the word cloud of the corpus as follows
Step four: Theme mining of text using Topicmodels package
To be Continued ...
Topic discovery using the R language (i)