Using support vector Machine (SVM) for data mining in R (above)

Source: Internet
Author: User
Tags svm

In R, you can use the various functions provided by the e1071 package to perform data analysis and mining tasks based on support vector machines. Please install and correctly reference the e1071 package before using the related function. One of the most important functions in this package is the SVM () function used to build the support vector machine model. We will use the following example to demonstrate its usage.

The data in the following example is derived from an important paper published by Fisher in 1936. He collected calyx and petal data from three irises (labeled Setosa, Versicolor, and Virginica, respectively). Includes the length and width of the calyx, as well as the length and width of the petals. Based on these four features, we will establish a support vector machine model to achieve the classification and discrimination of three kinds of iris.
The data can be obtained from the iris dataset in the Datasets package, which shows the first 5 rows of data. After successfully loading the data, it is easy to see that the CCP contains 150 samples (50 samples labeled Setosa, Versicolor and Virginica), and four sample features, namely Sepal.length, Sepal.width, Petal.length and Petal.width.


Before formal modeling, we can also use a graph to preliminarily determine the distribution of data, for this purpose in R using the following code to draw (only select Petal.length and petal.width the two characteristics) data division.

> library (Lattice) > Xyplot (petal.length ~ petal.width, data = iris, groups = species,+ auto.key=list (corner=c (1,0)) )

The result of the above code is shown in 14-13, it is not difficult to find that the iris marked as setosa can easily be divided. But when using only the Petal.length and Petal.width features, the versicolor and the virginica are not linearly divided.


function SVM () is available in two ways when establishing a support vector machine classification model. The first is to create a model based on a given formula, where the function uses the format

SVM (formula, Data= NULL, subset, na.action = Na.omit, scale= TRUE)

Where formula represents the form of a function model, data represents a set of optional formats that contain variables in the model. The parameter na.action is used to specify the processing that the system should take when there is invalid empty data in the sample data. The default value of Na.omit indicates that the program ignores samples that are missing from the data. Another optional assignment is Na.fail, which instructs the system to give an error message when it encounters empty data. The parameter scale is a logical vector that specifies whether the feature data needs to be normalized (the default is normalized to mean 0, variance 1). The index vector subset is used to specify the sampled data that will be trained on the model.
For example, we already know that Iris versicolor, which is marked as setosa and only when using the Petal.length and petal.width features, is linearly divided, so you can build the SVM model with the following code.


We can then use the following code to graphically display the model, which results in 14-14 of the execution.

> Plot (Model1, subdata, Petal.length ~ petal.width)



When you use the first format to model, you can briefly use the species~ when you use all of the feature variables in your data as model feature variables. "In." "Instead of all the feature variables. For example, the following code takes advantage of all four features to classify three types of iris.

> Model2 <-SVM (species ~., data = Iris)

To show how the model is built, using the summary () function is a good choice. Consider the following example code and its output.

Information about the model can be obtained through the summary function. Among them, Svm-type project Description This model is a C classifier model; Svm-kernel Project Description The kernel function used in this model is the Gosne product function and the value of the parameter gamma in the kernel function is 0.25;cost project description The constraint violation cost determined by this model is L. And we can see that the model found 51 support vectors: The first class contains 8 support vectors, the second class contains 22 support vectors, and the third class contains 21 support vectors. The last line shows that the three categories in the model are Setosa, versicolor, and virginica, respectively.

Because the original text is longer, I split into two parts, this article is the previous article.

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Using support vector Machine (SVM) for data mining in R (above)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.