1. e1701 Introduction
The e1071 package for the R language provides an interface to the LIBSVM. The library LIBSVM includes commonly used nuclei, such as linear, polynomial, rbf,sigmoid and so on. Multi-classification is achieved through a one-to-one voting mechanism (One-against-one voting scheme). Predict () is a training function, plot () visualize data, support vectors, decision boundaries (if provided). Parameter adjustment tune ().
The same result can be obtained by using the SVM function in the e1071 packet with LIBSVM. WRITE.SVM () is also able to write the results of R training as a standard LIBSVM format for use in other environments LIBSVM. Let's look at the use of the SVM () function. There are two types of formats available.
SVM (Formula,data=null,..., subset,na.action=na.omit,sacle=true) or
SVM (x, y = null, scale = TRUE, type = NULL, kernel = "radial", Degree = 3, gamma = if (Is.vector (x)) 1 else 1/ncol (x), Coef 0 = 0, cost = 1, Nu = 0.5, class.weights = NULL, CacheSize = max, tolerance = 0.001, epsilon = 0.1, shrinking = TRUE, Cross = 0, probability = FALSE, fitted = TRUE, ..., subset, na.action = Na.omit)
"Main parameter description"
Formula : The classification model form, in the second expression can be understood as y~x, that is, Y is equivalent to a label, X is equivalent to a feature (variable).
Data : box.
subset: You can specify a portion of the dataset as the training data. Na.cation: Missing value processing, the default is to delete missing data.
Scale : Normalize the data, centering it so that it has a mean value of 0 and a variance of 1 will be performed automatically.
type: the form of SVM. Can be divided into: c-classification, Nu-classification,
One-classification (for novelty detection), eps-regression, Nu-regression five forms. The latter two are used to make the return. The default is the C classifier.
kernel: in non-linear separable, we introduce kernel functions to do. The kernel functions provided in R are as follows:
Linear Core:: U ' *v
polynomial cores: (gamma*u ' *v + coef0) ^degree
Gaussian Core: exp (-gamma*|u-v|^2)
sigmoid nucleus: tanh (gamma*u ' *v + coef0)
The default is the Gaussian kernel. Incidentally, you can customize the kernel function in the kernel package.
Degree : number of polynomial cores, default = 3
Gamma: Except for the linear core, the parameters of the other cores, the default is 1/data dimension
COEF0: The parameters of the polynomial core and the sigmoid core are 0 by default.
cost : Value of penalty item C in C classification
nu : Nu classification, value of NU in a single category
Cross: do K-fold crossover verification to calculate the correctness of the classification.
2. Establish a support vector machine classifier for the kyphosis dataset in the Rpart package2.1 using SVM default parameters to build support vector machine classifier on kyphosis data set
> Library (e1071)
> Library (Rpart)
> kyphosis.svmmodel<-SVM (kyphosis~.,data = kyphosis)
> Summary (kyphosis)
Figure 2-1 Creating a support vector machine classifier
The kyphosis dataset is a data set that comes with Rpart, and Kyphosis is a data set describing children's corrective spine surgery, with a total of 4 variables. Depending on the English name of the variable, you can see that this is a data set with a hunchback, age, number, and start time variable, with a total of 81 rows of data.
Figure 2-2 Description of the SVM classifier
The model uses the default C classification of SVM (), the penalty item defaults to 1, the kernel function is radial, the other kernel's parameter gamma defaults to 1/data dimension, that is 1/3, the support vector (supports the sample points on the plane) the number of the total 39.
Total accuracy by 10 cross-validation: 77.77778
> PRE.SVM <-predict (Kyphosis.svmmodel, kyphosis)
> table (pre.svm,kyphosis$kyphosis)
Figure 2-3
using the model to predict the sample, it was found that absent was divided into 9 samples by mistake.
combined with 10-fold cross-validation to obtain the overall accuracy rate of 77.77778. We find that this model is not excellent enough.
2.2 Modification of the model
The penalty factor indicates how much penalty is added to the points of the error, and when the C is larger, the points will be less divided, but the cross-fitting may be more serious; When C is over the hour, the case of overfitting is relatively less serious, but the points of the error may be more and the accuracy of the model will decrease.
The penalty cost is modified by the default of 1 by 2.5.
> Kyphosis.svmmodel <-SVM (kyphosis~.,data =kyphosis,cost = 2.5,cross = 10)
> Summary (Kyphosis.svmmodel)
Figure 2-4
Figure 2-5
you can see that although the accuracy of the 10-fold cross-validation increases to 81.48148, there are still 8 sample errors.
Figure 2-6
Change the gamma factor from 1/3 to 3 and the cost to 4 to get a new model. The prediction of data is 100% accuracy.
However, 10-cross-validation shows an overall accuracy rate of 79.01235, which is lower than gamma = 1/3,cost=2.5. This indicates that the penalty factor C may have been somewhat large, and there have been cases of overfitting.
Build support vector machine classifier in kyphosis data set