R language-text mining topic Model Text classification

Source: Internet
Author: User

# # # #需要先安装几个R包, if you have these packages, you can omit the steps to install the package.
#install. Packages ("Rwordseg")
#install. Packages ("TM");
#install. Packages ("Wordcloud");
#install. Packages ("Topicmodels")

The data used in the example

data from Sougou laboratory data. data URL:http://download.labs.sogou.com/dl/sogoulabdown/SogouC.mini.20061102.tar.gz File Structure└─Sample ├─C000007 car├─C000008 Finance├─C000010 IT ├─C000013 Health├─C000014 Sports├─C000016 Tour├─C000020 Education├─C000022 Recruitment├─C000023 └─C000024 militaryuse Python to preprocess the data to a train. csv file and process each file text into 1 rows.


Preprocessing Python scripts
<ignore_js_op> combinesample.zip (720 Bytes, download count:)

The data you need
<ignore_js_op> train.zip (130.2 KB, downloaded: 164)
You can also use R to directly transform raw data into train.csv data.

Article Required Stopwords
<ignore_js_op> stopwords.zip (2.96 KB, download times:)

1. Read the database

    1. CSV <-read.csv ("D://wb//train.csv", Header=t, Stringsasfactors=f)
    2. mystopwords<-unlist (read.table ("D://wb//stopwords.txt", Stringsasfactors=f))
Copy Code

2.

Data preprocessing (Chinese word segmentation, stopwords processing)

    1. Library (tm);
    2. #移除数字
    3. RemoveNumbers = function (x) {ret = Gsub ("[0-90123456789]", "" ", X)}
    4. Sample.words <-lapply (csv$$$ $text, removenumbers)
Copy Code
    1. #处理中文分词, the RWORDSEG package is used here
    2. wordsegment<-function (x) {
    3. Library (RWORDSEG)
    4. SEGMENTCN (x)
    5. }
    6. Sample.words <-lapply (sample.words, wordsegment)
Copy Code
    1. # # #stopwords处理
    2. # # #先处理中文分词, then process stopwords to prevent global replacement of lost information
    3. Removestopwords = function (x,words) {
    4. ret = character (0)
    5. Index <-1
    6. It_max <-Length (x)
    7. while (index <= It_max) {
    8. if (length (words[words==x[index)]) <1) RET <-C (Ret,x[index])
    9. Index <-Index +1
    10. }
    11. Ret
    12. }
    13. Sample.words <-lapply (sample.words, Removestopwords, Mystopwords)
Copy Code

3. Wordcloud Display

    1. #构建语料库
    2. Corpus = Corpus (Vectorsource (sample.words))
    3. Meta (corpus, "cluster") <-csv$$$ $type
    4. Unique_type <-Unique (csv$$$ $type)
    5. #建立文档-Entry matrix
    6. (Sample.dtm <-Documenttermmatrix (corpus, control = list (Wordlengths = C (2, INF))))
Copy Code
    1. #install. Packages ("Wordcloud"); # #需要wordcloud包的支持
    2. Library (Wordcloud);
    3. #不同文档wordcloud对比图
    4. SAMPLE.TDM <-Termdocumentmatrix (corpus, control = list (Wordlengths = C (2, INF)));
    5. Tdm_matrix <-As.matrix (SAMPLE.TDM);
    6. PNG (Paste ("D://wb//sample_comparison", ". png", Sep = ""), width =, height = 1500);
    7. Comparison.cloud (Tdm_matrix,colors=rainbow (Ncol (Tdm_matrix)); # # # #由于颜色问题, slightly modified
    8. Title (main = "Sample comparision");
    9. Dev.off ();
Copy Code

    1. #按分类汇总wordcloud对比图
    2. n <-nrow (CSV)
    3. ZZ1 = 1:n
    4. Cluster_matrix<-sapply (unique_type,function (type) {apply (tdm_matrix[,zz1[csv$$$ $type ==type]],1,sum)})
    5. PNG (Paste ("D://wb//sample_ Cluster_comparison", ". png", Sep = ""), Width = +, height = 800)
    6. Comparison.cloud (Cluster_matrix,colors=brewer.pal (Ncol (Cluster_matrix), "paired")) # #由于颜色分类过少, modify here slightly
    7. Title (main = "Sample cluster comparision")
    8. Dev.off ()
Copy Code

<ignore_js_op>

It can be seen that data distribution is not uniform, culture, auto and other data is very small.

    1. #按各分类画wordcloud
    2. Sample.cloud <-function (cluster, maxwords = 100) {
    3. Words <-Sample.words[which (csv$$$ $type ==cluster)]
    4. Allwords <-unlist (words)
    5. Wordsfreq <-Sort (table (allwords), decreasing = T)
    6. Wordsname <-names (wordsfreq)
    7. PNG (Paste ("D://wb//sample_", Cluster, ". png", Sep = ""), Width = $, height = 600)
    8. Wordcloud (Wordsname, wordsfreq, scale = C (6, 1.5), Min.freq = 2, max.words = maxwords, colors = Rainbow (100))
    9. Title (main = Paste ("cluster:", cluster))
    10. Dev.off ()
    11. }
    12. Lapply (Unique_type,sample.cloud) # unique (csv$$$ $type)
Copy Code

<ignore_js_op>
<ignore_js_op>

4. Thematic model analysis

  1. Library (SLAM)
  2. Summary (Col_sums (SAMPLE.DTM))
  3. TERM_TFIDF <-tapply (sample.dtm$$$ $v/row_sums (SAMPLE.DTM) [sample.dtm$$$ $i], sample.dtm$$$ $j, mean) *
  4. Log2 (Ndocs (SAMPLE.DTM)/col_sums (Sample.dtm > 0))
  5. Summary (TERM_TFIDF)
  6. Sample.dtm <-sample.dtm[, TERM_TFIDF >= 0.1]
  7. Sample.dtm <-sample.dtm[row_sums (Sample.dtm) > 0,]
  8. Library (Topicmodels)
  9. K <-30
  10. SEED <-2010
  11. Sample_tm <-
  12. List
  13. VEM = LDA (Sample.dtm, k = k, control = list (seed = seed)),
  14. vem_fixed = LDA (Sample.dtm, k = K,control = List (Estimate.alpha = FALSE, seed = Seed),
  15. Gibbs = LDA (Sample.dtm, k = k, method = "Gibbs", control = list (seed = seed, Burnin = 1000,thin = +, iter = 1000)),
  16. CTM = CTM (Sample.dtm, k = K,control = List (seed = Seed,var = List (tol = 10^-4), em = list (tol = 10^-3))
  17. )
Copy Code

<ignore_js_op>

    1. Sapply (Sample_tm[1:2], slot, "Alpha")
    2. Sapply (Sample_tm, function (x) mean (apply (posterior (x) $$$ $topics, 1, function (z)-sum (z * log (z)))))
Copy Code

<ignore_js_op>

The alpha estimate is significantly less than the default value, which indicates that the Dirichlet distribution data is concentrated in partial data, and the document includes some topics.
Higher values indicate more uniform topic distribution

    1. #最可能的主题文档
    2. Topic <-topics (sample_tm[["VEM"], 1)
    3. Table (Topic)
    4. #每个Topic前5个Term
    5. Terms <-Terms (sample_tm[["VEM"], 5)
    6. TERMS[,1:10]
Copy Code

<ignore_js_op>

    1. ######### number of topics in each article in auto
    2. (Topics_auto <-topics (sample_tm[["VEM"]) [grep ("Auto", Csv[[1]])
    3. Most_frequent_auto <-Which.max (tabulate (Topics_auto))
    4. ######### 10 words that are most relevant to auto themes
    5. Terms (sample_tm[["VEM"], ten) [, Most_frequent_auto]
Copy Code

<ignore_js_op>

R language-text mining topic Model Text classification

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.