R language do text mining Part4

Last Update:2015-03-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

PART4 Text classification

Part3 text clustering, the simple difference between classification and clustering. So to do the classification we need to sort out a training set, that is, already have a clear classification of the text, test set can be used to replace the training set, prediction set, is not classified text, is the final application of the classification method implementation.

1. Data preparation

Training set preparation is a very cumbersome function, temporarily did not find any labor-saving method, according to the text content to manually organize. Here is also the use of a brand of official micro-data, according to micro-blog content, I will be the main content of its microblog: promotional information (promotion), product promotion (products), public information (publicwelfare), Life chicken Fashion information (fashionnews), film and Television Entertainment (showbiz), each category has 20-50 data, such as the following can be seen under the training set the number of text per category, training set classification named Chinese is no problem.

The training set is Hlzj.train and is later used as a test set.

The prediction set is the HLZJ inside the Part2.

> Hlzj.train <-read.csv ("Hlzj_train.csv", header=t,stringsasfactors=f)

> Length (hlzj.train)

[1] 2

> table (Hlzj.train$type)

Fashionnews Life Product

27 34 38

Promotion Publicwelfare Showbiz

45 22 36

> Length (HLZJ)

[1] 1639

2. Word processing

The training set, the test set, the forecast set all need to do the word segmentation processing before the subsequent classification process. This is no longer explained in detail, and the process is similar to that described in Part2. Training set after the word hlzjtraintemp, before the Hlzj file to do too much word after processing is hlzjtemp. Then the hlzjtraintemp and Hlzjtemp are removed respectively.

> Library (RWORDSEG)

Load the required thread bundle: Rjava

# version:0.2-1

> Hlzjtraintemp <-gsub ("[0-90123456789 < > ~]", "", Hlzj.train$text)

> Hlzjtraintemp <-SEGMENTCN (hlzjtraintemp)

> HLZJTRAINTEMP2 <-lapply (hlzjtraintemp,removestopwords,stopwords)

>HLZJTEMP2 <-lapply (hlzjtemp,removestopwords,stopwords)

3. Get The Matrix

In the PART3, it is necessary to convert the text to a matrix before doing the clustering, and the same process is needed for the classification, and the TM package is used. The results of the training set and the prediction set are first merged into Hlzjall, remembering that the first 202 (1:202) data is the training set, and the latter 1639 (203:1841) is the prediction set. Get the corpus of Hlzjall and get the document-term matrix and convert it to a normal matrix.

> Hlzjall <-character (0)

> hlzjall[1:202] <-hlzjTrainTemp2

> hlzjall[203:1841] <-hlzjTemp2

> Length (hlzjall)

[1] 1841

> Corpusall <-corpus (Vectorsource (Hlzjall))

> (Hlzjall.dtm <-documenttermmatrix (corpusall,control=list (wordlengths = C (2,inf))))

<<documenttermmatrix (documents:1841, terms:10973) >>

Non-/sparse entries:33663/20167630

sparsity:100%

Maximal term length:47

Weighting:term frequency (TF)

> Dtmall_matrix <-as.matrix (HLZJALL.DTM)

4. Classification

Using the KNN algorithm (k nearest neighbor algorithm), the algorithm is in the class software package. The first 202 rows of the matrix data is the training set, has been classified, the following 1639 data is not classified, according to the training set to get the classification model and then to do the classification of the prediction. After classifying the results and the original micro-blog together, with Fix () to see, you can see the classification results, the effect is quite obvious.

> Rownames (Dtmall_matrix) [1:202] <-hlzj.train$type

> Rownames (Dtmall_matrix) [203:1841]<-C ("")

> Train <-dtmall_matrix[1:202,]

> Predict <-dtmall_matrix[203:1841,]

> Trainclass <-as.factor (Rownames (train))

> Library (Class)

> Hlzj_knnclassify <-knn (train,predict,trainclass)

> Length (hlzj_knnclassify)

[1] 1639

> Hlzj_knnclassify[1:10]

[1] Product product product promotion Product Fashionnews life

[8] Product product Fashionnews

Levels:fashionnews Life Productpromotion publicwelfare Showbiz

> table (hlzj_knnclassify)

Hlzj_knnclassify

Fashionnews Life Product Promotion Publicwelfare showbiz

40 869 88 535 28 79

> Hlzj.knnresult <-list (TYPE=HLZJ_KNNCLASSIFY,TEXT=HLZJ)

> Hlzj.knnresult <-as.data.frame (Hlzj.knnresult)

> Fix (Hlzj.knnresult)

The KNN classification algorithm is the simplest one, after the attempt to use the Neural Network algorithm (Nnet ()), Support vector machine algorithm (SVM ()), Random Forest Algorithm (Randomforest ()), there is insufficient computer memory problem, my computer is 4G, See memory monitoring when you can see the maximum use reached 3.92G. Looks like we're going to change the station to Lidian weights's computer. ╮(╯▽╰)╭

When the hardware conditions can be achieved, the classification should be implemented without problems. The related algorithms can be used:?? Method name, in the way that you view its description document.

5. Classification effect

The above does not talk about the test process, for the above example, the KNN first two parameters are used train, because the use of the same data set, so the result is the correct rate can reach 100%. In the case of more training sets, it can be randomly assigned to 7:3 or 8:2 in two parts, the former training the latter to do the test is good. There is no longer a detailed statement.

In the case that the classification effect is not ideal, it is necessary to enrich the training set to improve the classification effect, so as to make the training set feature as obvious as possible, which is a tedious but not perfunctory process in the actual problem.

There is nothing to improve the place welcome correction, Reprint please indicate the source, thank you!

R language do text mining Part4

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

R language do text mining Part4

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

R language do text mining Part4

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support