Stanford Machine Learning Note-8. Support Vector Machines (SVMs) Overview

Source: Internet
Author: User
Tags svm

8. Support Vector machines (SVMs)
Content

    8. Support Vector machines (SVMs)

8.1 Optimization Objection

8.2 Large Margin Intuition

8.3 Mathematics Behind Large Margin classification

8.4 Kernels

8.5 Using a SVM

8.5.1 Multi-Class Classification

8.5.2 Logistic Regression vs. SVMs

8.1 Optimization Objection

Support Vector Machine (Support Vector MACHINE:SVM) is a very useful supervised machine learning algorithm. First review the logistic regression, according to the log () function and the properties of the sigmoid function, are:

At the same time, the cost function of the logistic regression (not regularization) is as follows:

To get the cost function of SVM, we make the following modifications:

Therefore, comparing the optimization goal of logistic

The optimization objectives of SVM are as follows:

Note 1: In fact, the COST0 and COST1 functions in the above formula are a substitution loss (surrogate loss) function called hinge loss , and other common substitution loss functions have exponential loss and for rate loss , see machine learning P129 Zhou Zhihua)

Note 2: Note the correspondence between the parameters C and λ: C is positively correlated with (1/λ).

8.2 Large Margin Intuition

According to the cost function in 8.1, in order to minimize the cost function, a conclusion is made:

Now suppose that C is large (like c=100000), in order to minimize the cost function, we want to

So the cost function becomes:

So the question becomes:

The final optimization result of the problem is to find the maximum margin with a "maximum interval" of the division of the super-plane, so the support vector machine is also called the large spacing classifier (large margin classifier). So what is the interval ? Why is this optimization so that you can find the maximum interval? First, we use the two-dimensional 0/1 linear classification as shown in figure 8-1 to visualize the situation.

Figure 8-1 SVM decision Boundary:linearly separable case

Intuitively, it should be found in the two types of training samples "in the middle" of the division of the super-plane, that is, figure 8-1 of the black line (two-dimensional), because the division of the super-plane to the training sample local disturbance "tolerance" is the best. For example, the pink and green lines in the figure will be incorrectly predicted once the input data changes slightly. In other words, this division of the super-plane results in the most robust classification, the ability to predict the data set is the strongest generalization. The distance between the two blue lines is called the interval (margin). The next section explains the optimization principle of the interval and the maximum interval from a mathematical perspective.

8.3 Mathematics Behind Large Margin Classification

First introduce some mathematics knowledge.

    • 2-Norm (2-norm): can also be called length, is the extension of two-dimensional or three-dimensional space vector length, the vector u is written as | | u| |. For example, for vector u = [U1, U2, U3, u4],| | u| | = sqrt (u1^2 + u2^2 + u3^2 + u4^2)
    • vector Inner product: Set vector a = [A1, a2, ..., an], vector b = [B1, b2, ..., Bn],a and B in the inner product definition: a b = a1b1 + a2b2 + ... + anbn. The inner product of a vector is the generalization of the number product (dot product) of a geometric vector, which can be understood as the product of the projection length (norm) and the length of vector b on vector b.

So there are:

Which is the projection length on the vector.

Therefore, the optimization problem obtained in section 8.2 can be converted to the following form:

The dividing line is so known and the dividing line orthogonal (perpendicular), and at that time, the demarcation line over the origin (European space). To make the goal optimal (take the minimum) and meet the constraints, it should be as large as possible, so that the spacing is as large as possible. Intuitive 8-2 shows that the left side of the picture is smaller, at this time the smaller, in order to meet the constraints, resulting in a larger target function, the graph right is the maximum spacing situation, at this time is the largest, so the target can be as small as possible.

Fig. 8-2 case of two different spacing

8.4 Kernels

All of the above discussion is based on a linearly-divided sample, that is, there is a division of the hyper-plane can correctly classify the training samples, but the real world there are a large number of complex, non-linear classification problems (such as the 4.4.2 section of the XOR/same or problem). The nonlinear problem of logistic regression processing can be solved by introducing the polynomial characteristic quantity as a new characteristic quantity, and the neural network solves the nonlinear classification problem by the layer evolution by introducing the hidden layer, and SVM is to solve the nonlinear problem by introducing the kernel function (kernel functions) . The following are the specific practices:

    1. For a given output x, specify a certain number of landmarks, recorded as;
    2. The x, as the input of the kernel function, obtains a new characteristic quantity, if the kernel function is similarity (), then there is

      , which corresponds to one by one;

    3. Replace the original feature quantity with the new feature quantity and get the assumption function as follows:

Now there are two problems,

    1. How to choose landmarks?
    2. What kernel functions are used?

For the first question, you can use the input of the training set as follows, as in landmarks

So the number of features is equal to the number of training sets, i.e. n = m, so the kernel-like SVM becomes the following form:

For the second problem, the commonly used kernel functions are linear nuclei, gaussian nuclei, polynomial nuclei, sigmoid nuclei, Laplace nuclei, etc., and are now used as examples of Gaussian nuclei (Gaussian).

The Gaussian nucleus has the following properties:

That is, if X and landmark are close, then the value of the kernel function, which is the new feature, will be close to 1, and if X and landmark are far away, then the value of the kernel function will be close to 0.

is the parameter of the Gaussian kernel, its size will affect the kernel function value change speed, concrete, figure 8-3 is a two-dimensional case of a special example, but the properties can be generalized. That is, the larger the kernel function changes (descent), the slower the reverse, the smaller the kernel function changes.

Figure 8-3 Examples of the effects of parameters on Gaussian nuclei

    • How do I select parameters?

The following is a brief analysis of the effects of SVM parameters on deviations and variances:

    • C: Due to the C and (1/ λ) positive correlation, the analysis of λ in conjunction with the 6.4.2 section is:

8.5 Using a SVM

The optimization principle of SVM and the use of kernel functions are briefly described above. In the actual application of SVM, we do not need to implement the SVM training algorithm to get the parameters, usually using the existing software packages (such as Liblinear, LIBSVM).

But the following work is what we need to do:

    • Select the value of parameter C
    • Selecting and implementing Kernel functions
      • If the kernel function takes parameters, you need to select the parameters of the kernel function, for example, the Gaussian kernel needs to select
      • If seedless (select Linear Core), the linear classifier is given, which is suitable for n large, M small case
      • Select a nonlinear nucleus (such as a Gaussian core), suitable for n small, m large cases

Here are some areas to note:

    • Normalization of feature quantities before kernel functions are used
    • Not all functions are valid kernel functions, they must satisfy the Mercer theorem.
    • If you want to train to get the parameters of the parameter C or kernel function, it should be done on the training set and cross test set, see section 6.3.
8.5.1 multi-class Classification

8.5.2 Logistic Regression vs. SVMs

Reference: "Machine learning" Zhou Zhihua

Stanford Machine Learning Note-8. Support Vector Machines (SVMs) Overview

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.