Nearest keywords: classification algorithm,outlier detection, machine Learning
Brief introduction:
This article will k-means,decision Tree,random FOREST,SVM (Support vector mechine), artificial neural networks (Artificial neural network, referred to as Ann) these common algorithms Apply in the same data set
Spam, look at various methods to predict the error rate, or accuracy, to pursue the accuracy of the prediction, identify the practicality of these methods, the theoretical basis behind, a large number of mathematical formulas, not to discuss (limited capacity, not understand the mathematical formula,,,)
PS: Look at the code for two days. Also found a lot of fun ways and websites, oh haha
First share a few tips, R blog:
A) How to quickly find the R package you want. To Available CRAN Packages by Date of the Publication control+f find, very fast.
b) How to find a small area of some column-related R packages. Take Randomforest as an example, to the page of this package, you will see a in view: environmetrics, machinelearning, then you can put the machine learning algorithm r packet clean sweep. Handsome, good, no longer need to check.
c) Very good one R blog http://mkseo.pe.kr/stats/inside there are Korean and English writing articles, do not know Bo is a Korean sister,,,, think more,,, sister how to coding,,,
d) re-recommend a blog. Copy it over and turn it into this, poke it, teenager.
Practical application:
(1) K-means
#----------------------------
Based on data set spam look at K-means
Library (Kernlab) library (magrittr) data (spam) set.seed (124) Res <-Kmeans (spam[,-58]%>%sapply (scale), 2) Table ( Spam$type,res$cluster) 1 2nonspam 2754 34spam 1813 0# error rate is 1-2754/nrow (spam) = 0.4014345# assuming spam and nonspam 55 Open, if it is completely blindfolded, the error rate is 50%,,,
Summarize:
Categorical variables cannot be used for K-means, and categorical variables cannot be simply numbered in. Can be given the distance between each category, can be used Kmeans.
Must be pre-scale for each column.
#----------------------------
(2) Decision tree
Library (tree) library (kernlab) library (DPLYR) library (magrittr) data (spam) #create Train and Test Datasetset.seed (1859) Train <-sample (Nrow (spam), nrow (spam) *0.7,replace = FALSE) df.train <-spam[train,]df.test <-Spam[-train,]
#modelingtree.fit <-Tree (Type~.,data=df.train) Summary (tree.fit)
# Plot decision Treeplot (tree.fit, type = "Uniform") text (Tree.fit, pretty =1, all=true,cex=0.7) # predicationpred <-Pre Dict (Tree.fit,df.test,type = C ("class")) #查看预测结果confusionMatrix (pred,df.test[,58]) confusion Matrix and Statisticsreferenceprediction nonspam spamnonspam 819 84spam 34
# Decision Tree Select variable model will automatically help you choose a choice, the prediction error rate is 0.08
Put a picture.
Summarize:
Quote Tree This package help document in one sentence: the Left-hand-side (response) should be either a numerical vectors when a regression tree would be fitted Or a Factor,when A classification tree is produced. Can do classification can also regression!
If it is done classification,factor predictor variables can has up to + levels.
Will decision Tree overfitting? I look down, is not, with Tree.fit on the test set prediction, the accuracy rate is still very high.
The help document says this: the split which maximizes the reduction in impurity is chosen, the data set split and the process repeated. splitting continues until the terminal nodes is too small or too few to be split.
Look at the code below, and you can prove that there is no overfitting.
> Summary (TREE.FIT) Classification Tree:tree (formula = Type ~., data = Df.train) Variables actually used in tree Constru Ction: [1] "chardollar" "Remove" "Charexclamation" [4] "George" "HP" "Capitallong" [7] "Edu" "num650" "Capitaltotal" [ten] "free" "capitalave" number of terminal nodes: residual Mean deviance:
Put an article decision Tree-overfitting
#---------------------------
(3) Random forest
Library (randomforest) library (magrittr) library (DPLYR) data (spam) Train <-sample (Nrow (spam), nrow (spam) *0.7, Replace = FALSE) df.train <-spam[train,]df.test <-spam[-train,]# random forest to pre-set random seeds, the result can be the same. Set.seed (189) Spam.rf <-randomforest (Type~.,data=df.train,mtry=3,do.trace=100,ntree=500,importance=true, Proximity=true)
# below is the runtime, when planted to the hundreds of tree, error rate is how much ntree OOB 1 2100: 5.56% 2.95% 9.49%200: 5.59% 2.69% 9.96%300: 5.53% 2.79% 9.65%400: 5.40% 2.84% 9.26%500: 5.28% 2.69% 9.18%
Spam.rfpred <-Predict (Spam.rf,df.test[,-58],type= "class") Confusionmatrix (pred,df.test[,58]) confusion Matrix and statisticsreferenceprediction nonspam spamnonspam 833 41spam 487accuracy:0.9558 # Error rate is low # The random forest has chosen the key variable, and there is no problem of overfitting
# The following command is to see those variables important varimpplot (SPAM.RF)
Varimpplot (SPAM.RF) runs as a graph, charexclamation this variable is critical for predicting whether spam is an item.
Summarize:
Will random forest overfitting? Random forest-how to handle overfitting
Breiman claims that RF does not overfit. Stat.berkeley.edu/~breiman/randomforests/cc_home.htm
From the predicted results, it is not overfitting. Besides, the developers of this package claim that they will not overfitting,
Random forest is also classification and regression can do
Put a piece of this package author's article: Http://www.bios.unc.edu/~dzeng/BIOS740/randomforest.pdf
#---------------------------
(4) SVM (Supprot vector mechine)
First look at the SVM of the e1071 package.
Library (e1071) library (Rpart) set.seed (1871) Train <-sample (Nrow (spam), nrow (spam) *0.7,replace = FALSE) df.train <-spam[train,]df.test <-spam[-train,]model <-SVM (df.train[,-58], df.train[,58]) print (model) Summary (model ) pred <-Predict (model, df.test[,-58]) Confusionmatrix (pred,df.test[,58]) confusion Matrix and Statisticsreferenceprediction nonspam spamnonspam 789 59spam Notoginseng 496accuracy:0.9305
Look again at the other, from the caret package, refer to computational prediction this article
Library (caret) library (DOMC) data (spam) set.seed (in) Train <-sample (Nrow (spam), nrow (spam) *0.7,replace = FALSE) Df.train <-spam[train,]df.test <-spam[-train,]# Multi-threading, is a function of DOMC package, looked at the monitor, and indeed CPU usage instantly soared to 90%+. # Look back and look at what's the difference between the parallel and the package REGISTERDOMC (cores=4) model <-train (df.train[, -58], df.train[,], method= " Svmradial ") predict (model,df.test[,-58])%>%confusionmatrix (df.test[,58]) # accuracy can also confusion Matrix and Statisticsreferenceprediction nonspam spamnonspam 816 70spam 42
Computational prediction This article also found a fun place, is can use a small function artificial to create some missing values, and then use the bagged tree imputation complement the missing value, has not studied its complement missing value logic is what, first put an article for reference.
Bagged tree imputation for missing values using caret.
Fillinna <-function (d) { nacount <-nrow (d) * 0.1 for (I-in-sample (Nrow (d), Nacount)) { d[i, sample (4, 1)] <-NA } return (d)}
#---------------------------
(5) Artificial Neural networks (Artificial neural Network)
Refer to the strongest neural network package Rsnns and Rsnns Help documentation in the R language, help documentation in Confusionmatrix (model), this sentence has errors, Must be encodeclasslabels, the correct wording is: Confusionmatrix (Encodeclasslabels (Iris$targetstrain), Encodeclasslabels ( Fitted.values (model)))
Library (Rsnns) library (DOMC) data (spam) set.seed (199) spam <-spam[sample (1:nrow (spam), nrow (spam)), 1:ncol (spam)] Spamvalues <-spam[,-58] spamtargets <-spam[,58] spamdectargets <-decodeclasslabels (spamTargets) spam <- Splitfortrainingandtest (spamvalues, spamdectargets, ratio = 0.3) spam <-normtrainingandtestset (spam) #The model is th En built with: # REGISTERDOMC (cores=4) looked at the activity monitor, CPU usage is around 30%, this thing is no eggs here. # The following function is time-consuming model <-MLP (Spam$inputstrain, spam$targetstrain, size = 5, Learnfuncparams = C (0.1), Max it = $, inputstest = spam$inputstest, targetstest = spam$targetstest) Predictions < ;-Predict (model, spam$inputstest) Confusionmatrix (Encodeclasslabels (spam$targetstest), Encodeclasslabels ( Predictions)) Confusion Matrix and statisticsreferenceprediction 1 805 422 500accuracy:0.945 95% CI : (0.9316, 0.9564) No information rate:0.6075 p-value [ACC > NIR]: <2e-16 kappa:0.8843 McNemar ' s Test p-value:0.422 sensitivity:0.9595 specificity:0.9225 Pos Pred value:0.9504 Neg Pred value:0.9363 prevalence:0.6075 Detection rate:0.5829 Detection prevalence:0.6133 Balanced accuracy:0.9410 ' Positive ' class:1
The prediction accuracy is slightly lower than the randomforest, and the boss mentioned several algorithms in Randomforest the best match. But the artificial neural network calculation is very time-consuming, went back to recalculate, looked at the next time
User system Elapsed
248.671 26.796 283.194
#---------------------------
(6) Bayesian network
See naive Bayesian classification and Bayesian networks
At last:
The theoretical parts of the above methods need to be further studied.
For some of the above methods, my code must still have flaws, or errors, please allow me to slowly improve, pat.
Introduction to classification algorithms based on R