PART4 Text classification
Part3 text clustering has been mentioned. Simple differences from cluster classification.
So, we need to sort out the classification of the training set, have a clear classification of the text, test set, can be used to replace the training set. Pre-set, which is unclassified text. Is the final application implementation of the classification method.
1. Data preparation
Training set preparation is a very tedious function, temporarily did not find any effort, according to the text content to manually organize. Here is also the use of a brand of official micro-data, based on micro-blog content. I have divided the main content of its microblog into: promotional information (promotion), product promotion (products), public information (publicwelfare), Life chicken Soup (living), fashion information (fashionnews), film and Television Entertainment (showbiz). Each of the categories has 20-50 data. For example, you can see the number of text in each category under the training set, and the training set class name is Chinese.
The training set is Hlzj.train and will be used as a test set later.
The pre-measured set is the hlzj inside the Part2.
watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvy2wxmtqzmde1otyx/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/center ">
> Hlzj.train <-read.csv ("Hlzj_train.csv", header=t,stringsasfactors=f)
> Length (hlzj.train)
[1] 2
> table (Hlzj.train$type)
Fashionnews Life Product
27 34 38
Promotion Publicwelfare Showbiz
45 22 36
> Length (HLZJ)
[1] 1639
2. Word processing
Training sets, test sets and pre-measured sets all need to do the classification process after the word segmentation.
This is no longer specific, and the process is similar to that described in Part2.
The training set hlzjtraintemp after the participle is finished. Before the Hlzj file to do too word processing is hlzjtemp.
Then the hlzjtraintemp and Hlzjtemp are removed respectively.
> Library (RWORDSEG)
Load the required thread bundle: Rjava
# version:0.2-1
> Hlzjtraintemp <-gsub ("[0-90123456789 < > ~]", "", Hlzj.train$text)
> Hlzjtraintemp <-SEGMENTCN (hlzjtraintemp)
> HLZJTRAINTEMP2 <-lapply (hlzjtraintemp,removestopwords,stopwords)
>HLZJTEMP2 <-lapply (hlzjtemp,removestopwords,stopwords)
3. Get The Matrix
In the PART3. When doing clustering, the text must first be converted to a matrix, and the same needs to be done to classify the process. Use the TM package. The results of the training set and the pre-set removal are merged into Hlzjall, remembering that the first 202 (1:202) data is the training set, and the latter 1639 (203:1841) is the pre-measured set. Get the corpus of Hlzjall, and get the document-term matrix. Convert it to a normal matrix.
> Hlzjall <-character (0)
> hlzjall[1:202] <-hlzjTrainTemp2
> hlzjall[203:1841] <-hlzjTemp2
> Length (hlzjall)
[1] 1841
> Corpusall <-corpus (Vectorsource (Hlzjall))
> (Hlzjall.dtm <-documenttermmatrix (corpusall,control=list (wordlengths = C (2,inf))))
<<documenttermmatrix (documents:1841, terms:10973) >>
Non-/sparse entries:33663/20167630
sparsity:100%
Maximal term length:47
Weighting:term frequency (TF)
> Dtmall_matrix <-as.matrix (HLZJALL.DTM)
4. Classification
The KNN algorithm (k nearest neighbor algorithm) is used. This algorithm is in the class software package.
The first 202 rows of data in the matrix are training sets, which are already categorized, and the following 1639 data are not categorized. The classification model should be obtained according to the training set.
Put the categorized results together with the original Weibo. With fix () to see, you can see the results of the classification, the effect is quite obvious.
> Rownames (Dtmall_matrix) [1:202] <-hlzj.train$type
> Rownames (Dtmall_matrix) [203:1841]<-C ("")
> Train <-dtmall_matrix[1:202,]
> Predict <-dtmall_matrix[203:1841,]
> Trainclass <-as.factor (Rownames (train))
> Library (Class)
> Hlzj_knnclassify <-knn (train,predict,trainclass)
> Length (hlzj_knnclassify)
[1] 1639
> Hlzj_knnclassify[1:10]
[1] Product product product promotion Product Fashionnews life
[8] Product product Fashionnews
Levels:fashionnews Life Productpromotion publicwelfare Showbiz
> table (hlzj_knnclassify)
Hlzj_knnclassify
Fashionnews Life Product Promotion Publicwelfare showbiz
40 869 88 535 28 79
> Hlzj.knnresult <-list (TYPE=HLZJ_KNNCLASSIFY,TEXT=HLZJ)
> Hlzj.knnresult <-as.data.frame (Hlzj.knnresult)
> Fix (Hlzj.knnresult)
The KNN classification algorithm is the simplest one. When you try to use a neural network algorithm (Nnet ()), a support vector machine algorithm (SVM ()), and a random forest algorithm (Randomforest ()). There is a lack of computer memory problems, my computer is 4G, see memory monitoring can see the highest use reached 3.92G.
Looks like we're going to change the station to Lidian weights's computer. ╮(╯▽╰)╭
When the hardware conditions can be achieved, the classification should be implemented without problems. The relevant algorithm can be used:?? Method name, in the way that you view its description document.
5. Classification effect
The above does not refer to the test process, for the above example, the KNN first two parameters are used train, because the same data set. So the result is the correct rate can reach 100%. In the case of more training sets. Can be randomly assigned to 7:3 or 8:2 in two parts, the former training the latter to do the test is good. There is no longer a detailed statement.
In cases where the classification effect is not ideal. Improve the classification effect needs to enrich the training set. Make the training set features as obvious as possible. This is a very cumbersome but not perfunctory process in the real problem.
What can improve the place welcome correction, Reprint please indicate the source, thank you!
Copyright notice: This article Bo Master original articles, blogs, without consent may not be reproduced.
R language Do text mining PART4 text classification