Introduction to classification algorithms based on R

Source: Internet
Author: User
Tags svm

Nearest keywords: classification algorithm,outlier detection, machine Learning

Brief introduction:

This article will k-means,decision Tree,random FOREST,SVM (Support vector mechine), artificial neural networks (Artificial neural network, referred to as Ann) these common algorithms Apply in the same data set

Spam, look at various methods to predict the error rate, or accuracy, to pursue the accuracy of the prediction, identify the practicality of these methods, the theoretical basis behind, a large number of mathematical formulas, not to discuss (limited capacity, not understand the mathematical formula,,,)

PS: Look at the code for two days. Also found a lot of fun ways and websites, oh haha

First share a few tips, R blog:

A) How to quickly find the R package you want. To Available CRAN Packages by Date of the Publication control+f find, very fast.

b) How to find a small area of some column-related R packages. Take Randomforest as an example, to the page of this package, you will see a in view: environmetrics, machinelearning, then you can put the machine learning algorithm r packet clean sweep. Handsome, good, no longer need to check.

c) Very good one R blog http://mkseo.pe.kr/stats/inside there are Korean and English writing articles, do not know Bo is a Korean sister,,,, think more,,, sister how to coding,,,

d) re-recommend a blog. Copy it over and turn it into this, poke it, teenager.

Practical application:

(1) K-means

#----------------------------

Based on data set spam look at K-means

Library (Kernlab) library (magrittr) data (spam) set.seed (124) Res <-Kmeans (spam[,-58]%>%sapply (scale), 2) Table ( Spam$type,res$cluster)         1   2nonspam 2754 34spam    1813 0# error rate is 1-2754/nrow (spam) = 0.4014345# assuming spam and nonspam 55 Open, if it is completely blindfolded, the error rate is 50%,,,

Summarize:

Categorical variables cannot be used for K-means, and categorical variables cannot be simply numbered in. Can be given the distance between each category, can be used Kmeans.

Must be pre-scale for each column.

#----------------------------

(2) Decision tree

Library (tree) library (kernlab) library (DPLYR) library (magrittr) data (spam) #create Train and Test Datasetset.seed (1859) Train <-sample (Nrow (spam), nrow (spam) *0.7,replace = FALSE) df.train <-spam[train,]df.test <-Spam[-train,]
#modelingtree.fit <-Tree (Type~.,data=df.train) Summary (tree.fit)
# Plot decision Treeplot (tree.fit, type = "Uniform") text (Tree.fit, pretty =1, all=true,cex=0.7) # predicationpred <-Pre Dict (Tree.fit,df.test,type = C ("class")) #查看预测结果confusionMatrix (pred,df.test[,58]) confusion Matrix and Statisticsreferenceprediction nonspam spamnonspam 819 84spam 34
# Decision Tree Select variable model will automatically help you choose a choice, the prediction error rate is 0.08

Put a picture.

Summarize:

Quote Tree This package help document in one sentence: the Left-hand-side (response) should be either a numerical vectors when a regression tree would be fitted Or a Factor,when A classification tree is produced. Can do classification can also regression!

If it is done classification,factor predictor variables can has up to + levels.

Will decision Tree overfitting? I look down, is not, with Tree.fit on the test set prediction, the accuracy rate is still very high.

The help document says this: the split which maximizes the reduction in impurity is chosen, the data set split and the process repeated. splitting continues until the terminal nodes is too small or too few to be split.

Look at the code below, and you can prove that there is no overfitting.

> Summary (TREE.FIT) Classification Tree:tree (formula = Type ~., data = Df.train) Variables actually used in tree Constru Ction: [1] "chardollar"      "Remove"          "Charexclamation" [4] "George"          "HP"              "Capitallong"     [7] "Edu"             "num650"          "Capitaltotal"   [ten] "free"            "capitalave" number of     terminal nodes:  residual Mean deviance:  

Put an article decision Tree-overfitting

#---------------------------

(3) Random forest

Library (randomforest) library (magrittr) library (DPLYR) data (spam) Train <-sample (Nrow (spam), nrow (spam) *0.7, Replace = FALSE) df.train <-spam[train,]df.test <-spam[-train,]# random forest to pre-set random seeds, the result can be the same. Set.seed (189) Spam.rf <-randomforest (Type~.,data=df.train,mtry=3,do.trace=100,ntree=500,importance=true, Proximity=true)
# below is the runtime, when planted to the hundreds of tree, error rate is how much ntree OOB 1 2100: 5.56% 2.95% 9.49%200: 5.59% 2.69% 9.96%300: 5.53% 2.79% 9.65%400: 5.40% 2.84% 9.26%500: 5.28% 2.69% 9.18%
Spam.rfpred <-Predict (Spam.rf,df.test[,-58],type= "class") Confusionmatrix (pred,df.test[,58]) confusion Matrix and statisticsreferenceprediction nonspam spamnonspam 833 41spam 487accuracy:0.9558 # Error rate is low # The random forest has chosen the key variable, and there is no problem of overfitting
# The following command is to see those variables important varimpplot (SPAM.RF)

Varimpplot (SPAM.RF) runs as a graph, charexclamation this variable is critical for predicting whether spam is an item.

Summarize:

Will random forest overfitting? Random forest-how to handle overfitting

Breiman claims that RF does not overfit. Stat.berkeley.edu/~breiman/randomforests/cc_home.htm

From the predicted results, it is not overfitting. Besides, the developers of this package claim that they will not overfitting,

Random forest is also classification and regression can do

Put a piece of this package author's article: Http://www.bios.unc.edu/~dzeng/BIOS740/randomforest.pdf

#---------------------------

(4) SVM (Supprot vector mechine)

First look at the SVM of the e1071 package.

Library (e1071) library (Rpart) set.seed (1871) Train <-sample (Nrow (spam), nrow (spam) *0.7,replace = FALSE) df.train <-spam[train,]df.test <-spam[-train,]model <-SVM (df.train[,-58], df.train[,58]) print (model) Summary (model ) pred <-Predict (model, df.test[,-58]) Confusionmatrix (pred,df.test[,58]) confusion Matrix and Statisticsreferenceprediction nonspam spamnonspam     789   59spam         Notoginseng  496accuracy:0.9305   

Look again at the other, from the caret package, refer to computational prediction this article

Library (caret) library (DOMC) data (spam) set.seed (in) Train <-sample (Nrow (spam), nrow (spam) *0.7,replace = FALSE) Df.train <-spam[train,]df.test <-spam[-train,]# Multi-threading, is a function of DOMC package, looked at the monitor, and indeed CPU usage instantly soared to 90%+. # Look back and look at what's the difference between the parallel and the package REGISTERDOMC (cores=4)  model <-train (df.train[, -58], df.train[,], method= " Svmradial ") predict (model,df.test[,-58])%>%confusionmatrix (df.test[,58]) # accuracy can also confusion Matrix and Statisticsreferenceprediction nonspam spamnonspam     816   70spam         42  

Computational prediction This article also found a fun place, is can use a small function artificial to create some missing values, and then use the bagged tree imputation complement the missing value, has not studied its complement missing value logic is what, first put an article for reference.

Bagged tree imputation for missing values using caret.

Fillinna <-function (d) {    nacount <-nrow (d) * 0.1    for (I-in-sample (Nrow (d), Nacount)) {        d[i, sample (4, 1)] <-NA    }    return (d)}

#---------------------------

(5) Artificial Neural networks (Artificial neural Network)

Refer to the strongest neural network package Rsnns and Rsnns Help documentation in the R language, help documentation in Confusionmatrix (model), this sentence has errors, Must be encodeclasslabels, the correct wording is: Confusionmatrix (Encodeclasslabels (Iris$targetstrain), Encodeclasslabels ( Fitted.values (model)))

Library (Rsnns) library (DOMC) data (spam) set.seed (199) spam <-spam[sample (1:nrow (spam), nrow (spam)), 1:ncol (spam)]  Spamvalues <-spam[,-58] spamtargets <-spam[,58] spamdectargets <-decodeclasslabels (spamTargets) spam <- Splitfortrainingandtest (spamvalues, spamdectargets, ratio = 0.3) spam <-normtrainingandtestset (spam) #The model is th En built with: # REGISTERDOMC (cores=4) looked at the activity monitor, CPU usage is around 30%, this thing is no eggs here. # The following function is time-consuming model <-MLP (Spam$inputstrain, spam$targetstrain, size = 5, Learnfuncparams = C (0.1), Max it = $, inputstest = spam$inputstest, targetstest = spam$targetstest) Predictions &lt ;-Predict (model, spam$inputstest) Confusionmatrix (Encodeclasslabels (spam$targetstest), Encodeclasslabels (  Predictions)) Confusion Matrix and statisticsreferenceprediction 1 805 422 500accuracy:0.945 95% CI      : (0.9316, 0.9564) No information rate:0.6075 p-value [ACC > NIR]: <2e-16    kappa:0.8843 McNemar ' s Test p-value:0.422                   sensitivity:0.9595 specificity:0.9225 Pos Pred value:0.9504             Neg Pred value:0.9363 prevalence:0.6075 Detection rate:0.5829                                                           Detection prevalence:0.6133 Balanced accuracy:0.9410    ' Positive ' class:1

The prediction accuracy is slightly lower than the randomforest, and the boss mentioned several algorithms in Randomforest the best match. But the artificial neural network calculation is very time-consuming, went back to recalculate, looked at the next time

User system Elapsed
248.671 26.796 283.194

#---------------------------

(6) Bayesian network

See naive Bayesian classification and Bayesian networks

At last:

The theoretical parts of the above methods need to be further studied.

For some of the above methods, my code must still have flaws, or errors, please allow me to slowly improve, pat.

Introduction to classification algorithms based on R

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.