PART4 Text classification
Part3 text clustering, the simple difference between classification and clustering. So to do the classification we need to sort out a training set, that is, already have a clear classification of the text, test set can be used to replace the training set, prediction set, is not classified text, is the final application of the classification method implementation.
1. Data preparation
Training set preparation is a very cumbersome function, temporarily did not find any labor-saving method, according to the text content to manually organize. Here is also the use of a brand of official micro-data, according to micro-blog content, I will be the main content of its microblog: promotional information (promotion), product promotion (products), public information (publicwelfare), Life chicken Fashion information (fashionnews), film and Television Entertainment (showbiz), each category has 20-50 data, such as the following can be seen under the training set the number of text per category, training set classification named Chinese is no problem.
The training set is Hlzj.train and is later used as a test set.
The prediction set is the HLZJ inside the Part2.
> Hlzj.train <-read.csv ("Hlzj_train.csv", header=t,stringsasfactors=f)
> Length (hlzj.train)
[1] 2
> table (Hlzj.train$type)
Fashionnews Life Product
27 34 38
Promotion Publicwelfare Showbiz
45 22 36
> Length (HLZJ)
[1] 1639
2. Word processing
The training set, the test set, the forecast set all need to do the word segmentation processing before the subsequent classification process. This is no longer explained in detail, and the process is similar to that described in Part2. Training set after the word hlzjtraintemp, before the Hlzj file to do too much word after processing is hlzjtemp. Then the hlzjtraintemp and Hlzjtemp are removed respectively.
> Library (RWORDSEG)
Load the required thread bundle: Rjava
# version:0.2-1
> Hlzjtraintemp <-gsub ("[0-90123456789 < > ~]", "", Hlzj.train$text)
> Hlzjtraintemp <-SEGMENTCN (hlzjtraintemp)
> HLZJTRAINTEMP2 <-lapply (hlzjtraintemp,removestopwords,stopwords)
>HLZJTEMP2 <-lapply (hlzjtemp,removestopwords,stopwords)
3. Get The Matrix
In the PART3, it is necessary to convert the text to a matrix before doing the clustering, and the same process is needed for the classification, and the TM package is used. The results of the training set and the prediction set are first merged into Hlzjall, remembering that the first 202 (1:202) data is the training set, and the latter 1639 (203:1841) is the prediction set. Get the corpus of Hlzjall and get the document-term matrix and convert it to a normal matrix.
> Hlzjall <-character (0)
> hlzjall[1:202] <-hlzjTrainTemp2
> hlzjall[203:1841] <-hlzjTemp2
> Length (hlzjall)
[1] 1841
> Corpusall <-corpus (Vectorsource (Hlzjall))
> (Hlzjall.dtm <-documenttermmatrix (corpusall,control=list (wordlengths = C (2,inf))))
<<documenttermmatrix (documents:1841, terms:10973) >>
Non-/sparse entries:33663/20167630
sparsity:100%
Maximal term length:47
Weighting:term frequency (TF)
> Dtmall_matrix <-as.matrix (HLZJALL.DTM)
4. Classification
Using the KNN algorithm (k nearest neighbor algorithm), the algorithm is in the class software package. The first 202 rows of the matrix data is the training set, has been classified, the following 1639 data is not classified, according to the training set to get the classification model and then to do the classification of the prediction. After classifying the results and the original micro-blog together, with Fix () to see, you can see the classification results, the effect is quite obvious.
> Rownames (Dtmall_matrix) [1:202] <-hlzj.train$type
> Rownames (Dtmall_matrix) [203:1841]<-C ("")
> Train <-dtmall_matrix[1:202,]
> Predict <-dtmall_matrix[203:1841,]
> Trainclass <-as.factor (Rownames (train))
> Library (Class)
> Hlzj_knnclassify <-knn (train,predict,trainclass)
> Length (hlzj_knnclassify)
[1] 1639
> Hlzj_knnclassify[1:10]
[1] Product product product promotion Product Fashionnews life
[8] Product product Fashionnews
Levels:fashionnews Life Productpromotion publicwelfare Showbiz
> table (hlzj_knnclassify)
Hlzj_knnclassify
Fashionnews Life Product Promotion Publicwelfare showbiz
40 869 88 535 28 79
> Hlzj.knnresult <-list (TYPE=HLZJ_KNNCLASSIFY,TEXT=HLZJ)
> Hlzj.knnresult <-as.data.frame (Hlzj.knnresult)
> Fix (Hlzj.knnresult)
The KNN classification algorithm is the simplest one, after the attempt to use the Neural Network algorithm (Nnet ()), Support vector machine algorithm (SVM ()), Random Forest Algorithm (Randomforest ()), there is insufficient computer memory problem, my computer is 4G, See memory monitoring when you can see the maximum use reached 3.92G. Looks like we're going to change the station to Lidian weights's computer. ╮(╯▽╰)╭
When the hardware conditions can be achieved, the classification should be implemented without problems. The related algorithms can be used:?? Method name, in the way that you view its description document.
5. Classification effect
The above does not talk about the test process, for the above example, the KNN first two parameters are used train, because the use of the same data set, so the result is the correct rate can reach 100%. In the case of more training sets, it can be randomly assigned to 7:3 or 8:2 in two parts, the former training the latter to do the test is good. There is no longer a detailed statement.
In the case that the classification effect is not ideal, it is necessary to enrich the training set to improve the classification effect, so as to make the training set feature as obvious as possible, which is a tedious but not perfunctory process in the actual problem.
There is nothing to improve the place welcome correction, Reprint please indicate the source, thank you!
R language do text mining Part4