Cluster analysis of NIPS conference documents using R language

Last Update:2015-11-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, withRLanguage Build Document Matrix

( I'm using R x64 3.2.2 here)

( Here I take the year NIPS Total 207 Document Analysis, Where the document content has been filtered by the beginning of the author name and the last reference )

# #1. Data Import Imports 3084 nipstxt documents

Library ("TM") # load TM package

stopwords<-unlist (read.table ("E:\\allcode\\r\\stopwords.txt", Stringsasfactors=f))

dir<-"E:\\newtext (No including Authors and References) \\2004" #NIPS path to a text document

Nips<-corpus (Dirsource (dir), readercontrol=list (language= "en"))

# #2. Transformations

Nips <-Tm_map (Nips, stripwhitespace) # go extra blank

Nips <-Tm_map (Nips, Content_transformer (ToLower)) # convert to lowercase

Nips <-Tm_map (Nips, Removewords, stopwords) # to stop using words

Library ("SNOWBALLC")

Nips <-tm_map (Nips, stemdocument) # extracting stems with Porter's stemming algorithm

# #3. Creating term-document matrices

# The processed corpus is hyphenated, and the word frequency weight matrix ( sparse matrix ) is also called the lexical document Matrix .

DTM <-Documenttermmatrix (Nips)

# #4. Reducing dimensions

# because the resulting matrix is a sparse matrix, then the dimension is reduced and then converted to the standard data frame format

# We can get rid of some words that appear too low.

dtm1<-removesparseterms (DTM, sparse=0.6) # In addition to the sparse entries in the Word frequency statistics below 40%

Data <-As.data.frame (Inspect (DTM1))

Second, Wordcloud

Library (Wordcloud);

Tdm<-termdocumentmatrix (Nips)

Tdm_matrix<-as.matrix (TDM)

V <-Sort (rowsums (Tdm_matrix), decreasing=true)

D <-data.frame (word = names (v), freq=v)

Wordcloud (D$word,d$freq,c (8,.3), 2)

PNG (Paste ("D://wb//sample_comparison", ". png", Sep = ""), width =, height = 1500);

Comparison.cloud (Tdm_matrix,colors=rainbow (Ncol (Tdm_matrix)); # # # #由于颜色问题, slightly modified

Title (main = "Sample comparision");

Dev.off ();

Third, the document matrix for cluster analysis

The result graph of hierarchical clustering is as follows: (unclear)

# #5. Clustering

# then you can use any of the tools in the R language to study it, try it with hierarchical clustering below

# First standardize the process, then generate the distance matrix, and then use hierarchical clustering

Data.scale <-Scale (data)

D <-Dist (Data.scale, method = "Euclidean")

Fit <-hclust (d, method= "Ward. D ")
Plot (Fit,main = " file cluster analysis ")

Of course, you can also use Kmeans Clustering:

# #5. Clustering

# Use kmeans cluster analysis below

Km<-kmeans (dtm1,centers=3)

Cluster analysis of NIPS conference documents using R language

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Cluster analysis of NIPS conference documents using R language

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support