R language Do text mining PART4 text classification

Last Update:2015-09-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

PART4 Text classification

Part3 text clustering has been mentioned. Simple differences from cluster classification.

So, we need to sort out the classification of the training set, have a clear classification of the text, test set, can be used to replace the training set. Pre-set, which is unclassified text. Is the final application implementation of the classification method.

1. Data preparation

Training set preparation is a very tedious function, temporarily did not find any effort, according to the text content to manually organize. Here is also the use of a brand of official micro-data, based on micro-blog content. I have divided the main content of its microblog into: promotional information (promotion), product promotion (products), public information (publicwelfare), Life chicken Soup (living), fashion information (fashionnews), film and Television Entertainment (showbiz). Each of the categories has 20-50 data. For example, you can see the number of text in each category under the training set, and the training set class name is Chinese.

The training set is Hlzj.train and will be used as a test set later.

The pre-measured set is the hlzj inside the Part2.

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvy2wxmtqzmde1otyx/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/center ">

> Hlzj.train <-read.csv ("Hlzj_train.csv", header=t,stringsasfactors=f)

> Length (hlzj.train)

[1] 2

> table (Hlzj.train$type)

Fashionnews Life Product

27 34 38

Promotion Publicwelfare Showbiz

45 22 36

> Length (HLZJ)

[1] 1639

2. Word processing

Training sets, test sets and pre-measured sets all need to do the classification process after the word segmentation.

This is no longer specific, and the process is similar to that described in Part2.

The training set hlzjtraintemp after the participle is finished. Before the Hlzj file to do too word processing is hlzjtemp.

Then the hlzjtraintemp and Hlzjtemp are removed respectively.

> Library (RWORDSEG)

Load the required thread bundle: Rjava

# version:0.2-1

> Hlzjtraintemp <-gsub ("[0-90123456789 < > ~]", "", Hlzj.train$text)

> Hlzjtraintemp <-SEGMENTCN (hlzjtraintemp)

> HLZJTRAINTEMP2 <-lapply (hlzjtraintemp,removestopwords,stopwords)

>HLZJTEMP2 <-lapply (hlzjtemp,removestopwords,stopwords)

3. Get The Matrix

In the PART3. When doing clustering, the text must first be converted to a matrix, and the same needs to be done to classify the process. Use the TM package. The results of the training set and the pre-set removal are merged into Hlzjall, remembering that the first 202 (1:202) data is the training set, and the latter 1639 (203:1841) is the pre-measured set. Get the corpus of Hlzjall, and get the document-term matrix. Convert it to a normal matrix.

> Hlzjall <-character (0)

> hlzjall[1:202] <-hlzjTrainTemp2

> hlzjall[203:1841] <-hlzjTemp2

> Length (hlzjall)

[1] 1841

> Corpusall <-corpus (Vectorsource (Hlzjall))

> (Hlzjall.dtm <-documenttermmatrix (corpusall,control=list (wordlengths = C (2,inf))))

<<documenttermmatrix (documents:1841, terms:10973) >>

Non-/sparse entries:33663/20167630

sparsity:100%

Maximal term length:47

Weighting:term frequency (TF)

> Dtmall_matrix <-as.matrix (HLZJALL.DTM)

4. Classification

The KNN algorithm (k nearest neighbor algorithm) is used. This algorithm is in the class software package.

The first 202 rows of data in the matrix are training sets, which are already categorized, and the following 1639 data are not categorized. The classification model should be obtained according to the training set.

Put the categorized results together with the original Weibo. With fix () to see, you can see the results of the classification, the effect is quite obvious.

> Rownames (Dtmall_matrix) [1:202] <-hlzj.train$type

> Rownames (Dtmall_matrix) [203:1841]<-C ("")

> Train <-dtmall_matrix[1:202,]

> Predict <-dtmall_matrix[203:1841,]

> Trainclass <-as.factor (Rownames (train))

> Library (Class)

> Hlzj_knnclassify <-knn (train,predict,trainclass)

> Length (hlzj_knnclassify)

[1] 1639

> Hlzj_knnclassify[1:10]

[1] Product product product promotion Product Fashionnews life

[8] Product product Fashionnews

Levels:fashionnews Life Productpromotion publicwelfare Showbiz

> table (hlzj_knnclassify)

Hlzj_knnclassify

Fashionnews Life Product Promotion Publicwelfare showbiz

40 869 88 535 28 79

> Hlzj.knnresult <-list (TYPE=HLZJ_KNNCLASSIFY,TEXT=HLZJ)

> Hlzj.knnresult <-as.data.frame (Hlzj.knnresult)

> Fix (Hlzj.knnresult)

The KNN classification algorithm is the simplest one. When you try to use a neural network algorithm (Nnet ()), a support vector machine algorithm (SVM ()), and a random forest algorithm (Randomforest ()). There is a lack of computer memory problems, my computer is 4G, see memory monitoring can see the highest use reached 3.92G.

Looks like we're going to change the station to Lidian weights's computer. ╮(╯▽╰)╭

When the hardware conditions can be achieved, the classification should be implemented without problems. The relevant algorithm can be used:?? Method name, in the way that you view its description document.

5. Classification effect

The above does not refer to the test process, for the above example, the KNN first two parameters are used train, because the same data set. So the result is the correct rate can reach 100%. In the case of more training sets. Can be randomly assigned to 7:3 or 8:2 in two parts, the former training the latter to do the test is good. There is no longer a detailed statement.

In cases where the classification effect is not ideal. Improve the classification effect needs to enrich the training set. Make the training set features as obvious as possible. This is a very cumbersome but not perfunctory process in the real problem.

What can improve the place welcome correction, Reprint please indicate the source, thank you!

R language Do text mining PART4 text classification

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

R language Do text mining PART4 text classification

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

R language Do text mining PART4 text classification

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support