R language do text mining Part4

Source: Internet
Author: User

PART4 Text classification


Part3 text clustering, the simple difference between classification and clustering. So to do the classification we need to sort out a training set, that is, already have a clear classification of the text, test set can be used to replace the training set, prediction set, is not classified text, is the final application of the classification method implementation.

1. Data preparation

Training set preparation is a very cumbersome function, temporarily did not find any labor-saving method, according to the text content to manually organize. Here is also the use of a brand of official micro-data, according to micro-blog content, I will be the main content of its microblog: promotional information (promotion), product promotion (products), public information (publicwelfare), Life chicken Fashion information (fashionnews), film and Television Entertainment (showbiz), each category has 20-50 data, such as the following can be seen under the training set the number of text per category, training set classification named Chinese is no problem.

The training set is Hlzj.train and is later used as a test set.

The prediction set is the HLZJ inside the Part2.

> Hlzj.train <-read.csv ("Hlzj_train.csv", header=t,stringsasfactors=f)

> Length (hlzj.train)

[1] 2

> table (Hlzj.train$type)

Fashionnews Life Product

27 34 38

Promotion Publicwelfare Showbiz

45 22 36

> Length (HLZJ)

[1] 1639

2. Word processing

The training set, the test set, the forecast set all need to do the word segmentation processing before the subsequent classification process. This is no longer explained in detail, and the process is similar to that described in Part2. Training set after the word hlzjtraintemp, before the Hlzj file to do too much word after processing is hlzjtemp. Then the hlzjtraintemp and Hlzjtemp are removed respectively.

> Library (RWORDSEG)

Load the required thread bundle: Rjava

# version:0.2-1

> Hlzjtraintemp <-gsub ("[0-90123456789 < > ~]", "", Hlzj.train$text)

> Hlzjtraintemp <-SEGMENTCN (hlzjtraintemp)

> HLZJTRAINTEMP2 <-lapply (hlzjtraintemp,removestopwords,stopwords)

>HLZJTEMP2 <-lapply (hlzjtemp,removestopwords,stopwords)

3. Get The Matrix

In the PART3, it is necessary to convert the text to a matrix before doing the clustering, and the same process is needed for the classification, and the TM package is used. The results of the training set and the prediction set are first merged into Hlzjall, remembering that the first 202 (1:202) data is the training set, and the latter 1639 (203:1841) is the prediction set. Get the corpus of Hlzjall and get the document-term matrix and convert it to a normal matrix.

> Hlzjall <-character (0)

> hlzjall[1:202] <-hlzjTrainTemp2

> hlzjall[203:1841] <-hlzjTemp2

> Length (hlzjall)

[1] 1841

> Corpusall <-corpus (Vectorsource (Hlzjall))

> (Hlzjall.dtm <-documenttermmatrix (corpusall,control=list (wordlengths = C (2,inf))))

<<documenttermmatrix (documents:1841, terms:10973) >>

Non-/sparse entries:33663/20167630

sparsity:100%

Maximal term length:47

Weighting:term frequency (TF)

> Dtmall_matrix <-as.matrix (HLZJALL.DTM)

4. Classification

Using the KNN algorithm (k nearest neighbor algorithm), the algorithm is in the class software package. The first 202 rows of the matrix data is the training set, has been classified, the following 1639 data is not classified, according to the training set to get the classification model and then to do the classification of the prediction. After classifying the results and the original micro-blog together, with Fix () to see, you can see the classification results, the effect is quite obvious.

> Rownames (Dtmall_matrix) [1:202] <-hlzj.train$type

> Rownames (Dtmall_matrix) [203:1841]<-C ("")

> Train <-dtmall_matrix[1:202,]

> Predict <-dtmall_matrix[203:1841,]

> Trainclass <-as.factor (Rownames (train))

> Library (Class)

> Hlzj_knnclassify <-knn (train,predict,trainclass)

> Length (hlzj_knnclassify)

[1] 1639

> Hlzj_knnclassify[1:10]

[1] Product product product promotion Product Fashionnews life

[8] Product product Fashionnews

Levels:fashionnews Life Productpromotion publicwelfare Showbiz

> table (hlzj_knnclassify)

Hlzj_knnclassify

Fashionnews Life Product Promotion Publicwelfare showbiz

40 869 88 535 28 79

> Hlzj.knnresult <-list (TYPE=HLZJ_KNNCLASSIFY,TEXT=HLZJ)

> Hlzj.knnresult <-as.data.frame (Hlzj.knnresult)

> Fix (Hlzj.knnresult)

The KNN classification algorithm is the simplest one, after the attempt to use the Neural Network algorithm (Nnet ()), Support vector machine algorithm (SVM ()), Random Forest Algorithm (Randomforest ()), there is insufficient computer memory problem, my computer is 4G, See memory monitoring when you can see the maximum use reached 3.92G. Looks like we're going to change the station to Lidian weights's computer. ╮(╯▽╰)╭

When the hardware conditions can be achieved, the classification should be implemented without problems. The related algorithms can be used:?? Method name, in the way that you view its description document.

5. Classification effect

The above does not talk about the test process, for the above example, the KNN first two parameters are used train, because the use of the same data set, so the result is the correct rate can reach 100%. In the case of more training sets, it can be randomly assigned to 7:3 or 8:2 in two parts, the former training the latter to do the test is good. There is no longer a detailed statement.

In the case that the classification effect is not ideal, it is necessary to enrich the training set to improve the classification effect, so as to make the training set feature as obvious as possible, which is a tedious but not perfunctory process in the actual problem.


There is nothing to improve the place welcome correction, Reprint please indicate the source, thank you!

R language do text mining Part4

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.