R learning-R for Text Mining (TM package)

Source: Internet
Author: User

 

First, install and load the TM package.

 

1. Read text

x = readLines("222.txt")

2. Build a corpus

 > r=Corpus(VectorSource(x)) > r A corpus with 7012 text documents

3. Corpus output, saved to Hard Disk

> writeCorpus(r)

 

4. view the Corpus

> print(r)A corpus with 7012 text documents> summary(r)A corpus with 7012 text documentsThe metadata consists of 2 tag-value pairs and a data frameAvailable tags are:  create_date creator Available variables in the data frame are:  MetaID 

> Inspect (R [2])
A corpus with 1 Text Document

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
Create_date creator
Available variables in the data frame are:
Metaid

[[1]
Female; genstmneoplasms, female/* therapy; humans

> R [[2]
Female; genstmneoplasms, female/* therapy; humans

5. Create a "document-word" Matrix

> dtm = DocumentTermMatrix(r)> head(dtm)A document-term matrix (6 documents, 16381 terms)Non-/sparse entries: 110/98176Sparsity           : 100%Maximal term length: 81 Weighting          : term frequency (tf)

6. view the "document-word" Matrix

> inspect(dtm[1:2,1:4])

7. Search for words that appear more than 200 times

> findFreqTerms(dtm,200) [1] "acute"          "adjuvant"       "advanced"       "after"          [5] "and"            "breast"         "cancer"         "cancer:"        [9] "carcinoma"      "cell"           "chemotherapy"   "clinical"      [13] "colorectal"     "factor"         "for"            "from"          [17] "group"          "growth"         "iii"            "leukemia"      [21] "lung"           "lymphoma"       "metastatic"     "non-small-cell"[25] "oncology"       "patients"       "phase"          "plus"          [29] "prostate"       "randomized"     "receptor"       "response"      [33] "results"        "risk"           "study"          "survival"      [37] "the"            "therapy"        "treatment"      "trial"         [41] "tumor"          "with"          

7. Remove words that appear less frequently

inspect(removeSparseTerms(dtm, 0.4))

8. Search for words with a correlation coefficient of more than 0.5 with "stem"

> findAssocs(dtm, "stem", 0.5) stem cells  1.00  0.61 

9. Calculate document similarity (using cosine to calculate distance)

> dist_dtm <- dissimilarity(dtm, method = 'cosine')> head(dist_dtm)[1] 1.0000000 0.7958759 0.8567770 0.9183503 0.9139337 0.9309934

10. Clustering

> hc <- hclust(dist_dtm, method = 'ave')> plot(hc,xlab='')

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.