In R, you can use the various functions provided by the e1071 package to perform data analysis and mining tasks based on support vector machines. Please install and correctly reference the e1071 package before using the related function. One of the most important functions in this package is the SVM () function used to build the support vector machine model. We will use the following example to demonstrate its usage.
The data in the following example is derived from an important paper published by Fisher in 1936. He collected calyx and petal data from three irises (labeled Setosa, Versicolor, and Virginica, respectively). Includes the length and width of the calyx, as well as the length and width of the petals. Based on these four features, we will establish a support vector machine model to achieve the classification and discrimination of three kinds of iris.
The data can be obtained from the iris dataset in the Datasets package, which shows the first 5 rows of data. After successfully loading the data, it is easy to see that the CCP contains 150 samples (50 samples labeled Setosa, Versicolor and Virginica), and four sample features, namely Sepal.length, Sepal.width, Petal.length and Petal.width.
Before formal modeling, we can also use a graph to preliminarily determine the distribution of data, for this purpose in R using the following code to draw (only select Petal.length and petal.width the two characteristics) data division.
> library (Lattice) > Xyplot (petal.length ~ petal.width, data = iris, groups = species,+ auto.key=list (corner=c (1,0)) )
The result of the above code is shown in 14-13, it is not difficult to find that the iris marked as setosa can easily be divided. But when using only the Petal.length and Petal.width features, the versicolor and the virginica are not linearly divided.
function SVM () is available in two ways when establishing a support vector machine classification model. The first is to create a model based on a given formula, where the function uses the format
SVM (formula, Data= NULL, subset, na.action = Na.omit, scale= TRUE)
Where formula represents the form of a function model, data represents a set of optional formats that contain variables in the model. The parameter na.action is used to specify the processing that the system should take when there is invalid empty data in the sample data. The default value of Na.omit indicates that the program ignores samples that are missing from the data. Another optional assignment is Na.fail, which instructs the system to give an error message when it encounters empty data. The parameter scale is a logical vector that specifies whether the feature data needs to be normalized (the default is normalized to mean 0, variance 1). The index vector subset is used to specify the sampled data that will be trained on the model.
For example, we already know that Iris versicolor, which is marked as setosa and only when using the Petal.length and petal.width features, is linearly divided, so you can build the SVM model with the following code.
We can then use the following code to graphically display the model, which results in 14-14 of the execution.
> Plot (Model1, subdata, Petal.length ~ petal.width)
When you use the first format to model, you can briefly use the species~ when you use all of the feature variables in your data as model feature variables. "In." "Instead of all the feature variables. For example, the following code takes advantage of all four features to classify three types of iris.
> Model2 <-SVM (species ~., data = Iris)
To show how the model is built, using the summary () function is a good choice. Consider the following example code and its output.
Information about the model can be obtained through the summary function. Among them, Svm-type project Description This model is a C classifier model; Svm-kernel Project Description The kernel function used in this model is the Gosne product function and the value of the parameter gamma in the kernel function is 0.25;cost project description The constraint violation cost determined by this model is L. And we can see that the model found 51 support vectors: The first class contains 8 support vectors, the second class contains 22 support vectors, and the third class contains 21 support vectors. The last line shows that the three categories in the model are Setosa, versicolor, and virginica, respectively.
Because the original text is longer, I split into two parts, this article is the previous article.
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Using support vector Machine (SVM) for data mining in R (above)