First, withRLanguage Build Document Matrix
( I'm using R x64 3.2.2 here)
( Here I take the year NIPS Total 207 Document Analysis, Where the document content has been filtered by the beginning of the author name and the last reference )
# #1. Data Import Imports 3084 nipstxt documents
Library ("TM") # load TM package
stopwords<-unlist (read.table ("E:\\allcode\\r\\stopwords.txt", Stringsasfactors=f))
dir<-"E:\\newtext (No including Authors and References) \\2004" #NIPS path to a text document
Nips<-corpus (Dirsource (dir), readercontrol=list (language= "en"))
# #2. Transformations
Nips <-Tm_map (Nips, stripwhitespace) # go extra blank
Nips <-Tm_map (Nips, Content_transformer (ToLower)) # convert to lowercase
Nips <-Tm_map (Nips, Removewords, stopwords) # to stop using words
Library ("SNOWBALLC")
Nips <-tm_map (Nips, stemdocument) # extracting stems with Porter's stemming algorithm
# #3. Creating term-document matrices
# The processed corpus is hyphenated, and the word frequency weight matrix ( sparse matrix ) is also called the lexical document Matrix .
DTM <-Documenttermmatrix (Nips)
# #4. Reducing dimensions
# because the resulting matrix is a sparse matrix, then the dimension is reduced and then converted to the standard data frame format
# We can get rid of some words that appear too low.
dtm1<-removesparseterms (DTM, sparse=0.6) # In addition to the sparse entries in the Word frequency statistics below 40%
Data <-As.data.frame (Inspect (DTM1))
Second, Wordcloud
Library (Wordcloud);
Tdm<-termdocumentmatrix (Nips)
Tdm_matrix<-as.matrix (TDM)
V <-Sort (rowsums (Tdm_matrix), decreasing=true)
D <-data.frame (word = names (v), freq=v)
Wordcloud (D$word,d$freq,c (8,.3), 2)
PNG (Paste ("D://wb//sample_comparison", ". png", Sep = ""), width =, height = 1500);
Comparison.cloud (Tdm_matrix,colors=rainbow (Ncol (Tdm_matrix)); # # # #由于颜色问题, slightly modified
Title (main = "Sample comparision");
Dev.off ();
Third, the document matrix for cluster analysis
The result graph of hierarchical clustering is as follows: (unclear)
# #5. Clustering
# then you can use any of the tools in the R language to study it, try it with hierarchical clustering below
# First standardize the process, then generate the distance matrix, and then use hierarchical clustering
Data.scale <-Scale (data)
D <-Dist (Data.scale, method = "Euclidean")
Fit <-hclust (d, method= "Ward. D ")
Plot (Fit,main = " file cluster analysis ")
Of course, you can also use Kmeans Clustering:
# #5. Clustering
# Use kmeans cluster analysis below
Km<-kmeans (dtm1,centers=3)
Cluster analysis of NIPS conference documents using R language