Cluster analysis of NIPS conference documents using R language

Source: Internet
Author: User

First, withRLanguage Build Document Matrix

( I'm using R x64 3.2.2 here)

( Here I take the year NIPS Total 207 Document Analysis, Where the document content has been filtered by the beginning of the author name and the last reference )

# #1. Data Import Imports 3084 nipstxt documents

Library ("TM") # load TM package

stopwords<-unlist (read.table ("E:\\allcode\\r\\stopwords.txt", Stringsasfactors=f))

dir<-"E:\\newtext (No including Authors and References) \\2004" #NIPS path to a text document

Nips<-corpus (Dirsource (dir), readercontrol=list (language= "en"))

# #2. Transformations

Nips <-Tm_map (Nips, stripwhitespace) # go extra blank

Nips <-Tm_map (Nips, Content_transformer (ToLower)) # convert to lowercase

Nips <-Tm_map (Nips, Removewords, stopwords) # to stop using words

Library ("SNOWBALLC")

Nips <-tm_map (Nips, stemdocument) # extracting stems with Porter's stemming algorithm

# #3. Creating term-document matrices

# The processed corpus is hyphenated, and the word frequency weight matrix ( sparse matrix ) is also called the lexical document Matrix .

DTM <-Documenttermmatrix (Nips)

# #4. Reducing dimensions

# because the resulting matrix is a sparse matrix, then the dimension is reduced and then converted to the standard data frame format

# We can get rid of some words that appear too low.

dtm1<-removesparseterms (DTM, sparse=0.6) # In addition to the sparse entries in the Word frequency statistics below 40%

Data <-As.data.frame (Inspect (DTM1))

Second, Wordcloud

Library (Wordcloud);

Tdm<-termdocumentmatrix (Nips)

Tdm_matrix<-as.matrix (TDM)

V <-Sort (rowsums (Tdm_matrix), decreasing=true)

D <-data.frame (word = names (v), freq=v)

Wordcloud (D$word,d$freq,c (8,.3), 2)

PNG (Paste ("D://wb//sample_comparison", ". png", Sep = ""), width =, height = 1500);

Comparison.cloud (Tdm_matrix,colors=rainbow (Ncol (Tdm_matrix)); # # # #由于颜色问题, slightly modified

Title (main = "Sample comparision");

Dev.off ();


Third, the document matrix for cluster analysis

The result graph of hierarchical clustering is as follows: (unclear)

# #5. Clustering

# then you can use any of the tools in the R language to study it, try it with hierarchical clustering below

# First standardize the process, then generate the distance matrix, and then use hierarchical clustering

Data.scale <-Scale (data)

D <-Dist (Data.scale, method = "Euclidean")

Fit <-hclust (d, method= "Ward. D ")
Plot (Fit,main = " file cluster analysis ")

Of course, you can also use Kmeans Clustering:

# #5. Clustering

# Use kmeans cluster analysis below

Km<-kmeans (dtm1,centers=3)

Cluster analysis of NIPS conference documents using R language

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.