First, install and load the TM package.
1. Read text
x = readLines("222.txt")
2. Build a corpus
> r=Corpus(VectorSource(x)) > r A corpus with 7012 text documents
3. Corpus output, saved to Hard Disk
> writeCorpus(r)
4. view the Corpus
> print(r)A corpus with 7012 text documents> summary(r)A corpus with 7012 text documentsThe metadata consists of 2 tag-value pairs and a data frameAvailable tags are: create_date creator Available variables in the data frame are: MetaID
> Inspect (R [2])
A corpus with 1 Text Document
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
Create_date creator
Available variables in the data frame are:
Metaid
[[1]
Female; genstmneoplasms, female/* therapy; humans
> R [[2]
Female; genstmneoplasms, female/* therapy; humans
5. Create a "document-word" Matrix
> dtm = DocumentTermMatrix(r)> head(dtm)A document-term matrix (6 documents, 16381 terms)Non-/sparse entries: 110/98176Sparsity : 100%Maximal term length: 81 Weighting : term frequency (tf)
6. view the "document-word" Matrix
> inspect(dtm[1:2,1:4])
7. Search for words that appear more than 200 times
> findFreqTerms(dtm,200) [1] "acute" "adjuvant" "advanced" "after" [5] "and" "breast" "cancer" "cancer:" [9] "carcinoma" "cell" "chemotherapy" "clinical" [13] "colorectal" "factor" "for" "from" [17] "group" "growth" "iii" "leukemia" [21] "lung" "lymphoma" "metastatic" "non-small-cell"[25] "oncology" "patients" "phase" "plus" [29] "prostate" "randomized" "receptor" "response" [33] "results" "risk" "study" "survival" [37] "the" "therapy" "treatment" "trial" [41] "tumor" "with"
7. Remove words that appear less frequently
inspect(removeSparseTerms(dtm, 0.4))
8. Search for words with a correlation coefficient of more than 0.5 with "stem"
> findAssocs(dtm, "stem", 0.5) stem cells 1.00 0.61
9. Calculate document similarity (using cosine to calculate distance)
> dist_dtm <- dissimilarity(dtm, method = 'cosine')> head(dist_dtm)[1] 1.0000000 0.7958759 0.8567770 0.9183503 0.9139337 0.9309934
10. Clustering
> hc <- hclust(dist_dtm, method = 'ave')> plot(hc,xlab='')