In the field of machine learning, SVM is a supervised learning model associated with learning algorithms that can analyze data used for classification and regression. Given a set of training samples, each marked as a class in two classes, a SVM training algorithm constructs a model that can divide new data into a class, making it a non-probabilistic class two linear classifier. A SVM model is a representation of sample points in space (mapped) so that samples in different categories can be separated by a maximum interval. The new sample is mapped to the same space and is predicted to belong to a category, based on which side of the interval the sample points fall on. In addition to the function of linear classification, SVM can effectively display the function of nonlinear classification, by using the kernel technique, implicitly mapping the input to the high dimensional feature space. If the data is not tagged, a supervised learning is not possible, at this time unsupervised learning is required, unsupervised learning will explore the natural clustering of the data to group, and map the new data into these formed groups. This provides an improved clustering algorithm for SVM called SVC (Support vector clustering), and is widely used in industrial applications as a way of preprocessing, that is, when the data is not labeled, but only some of the data is tagged.
Defined
More formally, a support vector mechanism makes a hyper-plane, or a set of hyper-planes, in a high-dimensional or infinite-dimensional space that can be used for classification, regression, or other tasks. Intuitively, a good separation (classification) can be obtained with a super plane, the distance from the nearest training point in each class is the largest, this distance is called the function interval, because the larger the interval, the lower the generalized error of the classifier.
However, the original problem may be described in a finite dimensional space, and the dataset is usually not linearly divided in that space. In view of this, the original finite dimensional space can be mapped to a higher dimensional space, in the high-dimensional space may make the segmentation easier. To calculate the load more rationally, the mapping used by the SVM mechanism is designed to ensure that the point product of the variables in the original space can be easily computed by defining the dot product with the selected kernel function $k (x, y) $ to accommodate the problem. A hyper-plane in a high-dimensional space is defined as a set of points in which the inner product (dot product) of a point represented by a vector is a constant. The vector that defines the hyper-plane can be selected as a linear combination of the parameter $\alpha_i$ of the $x_i$ image of the feature vector appearing in the database. After this selection, the point x in the feature space to be mapped to the hyper plane is defined by the relationship: $$\sum_i \alpha_i k (x_i,x) = \mathrm{constant}$$
Note that if $k (x, y) $ is smaller with $y$ away from $x$, each item in and ($sum $) can measure the proximity of the test point $x$ to the Data Datum point $x_i$ (training point), so that the above kernel functions and can be used to measure relative proximity, The proximity of each test point to the data point that was initially to be split in one or another collection.
Notice the fact that it is mapped to a collection of point $x$ (arbitrary hyperspace) that allows for more complex splits between the collections, which are not necessarily convex sets in all the original spaces.
Motivation
H1 cannot separate classes, H2 can. But with only a small interval, H3 separates them with one of the largest intervals. Data classification is a general task of machine learning. Suppose there are some given data points, each of which falls into one of two categories, and the goal is to decide which class to enter for a new data point. In the SVM scenario, the data points are treated as p-dimensional vectors (a list of P-numbers), and we want to know if we can separate them with the (p-1)-dimensional hyper plane, which is the linear classifier. There are many hyper-planes may classify the data, a reasonable choice is the best super-plane can represent the largest division or interval in two categories, so we choose this super-plane, so that we can make the distance between the most recent data points of the plane is the largest, if such a super-plane exists, is considered to be the maximum interval super-plane, the linear classifier defined by it is considered to be the maximum interval classifier, or equivalent to the perception of the optimal stability of the machine.
We are given a set of training datasets containing n data points in the form of:
Where Yi is 1 or-1, each point of data point Xi belongs to which category. Each XI is a p-dimensional real vector. We want to find the "Maximum interval Super plane", which divides the yi=1 points from the yi=-1 points. This allows the closest point on both sides to the super-plane distance is the largest.
Any hyper-plane can be written as a set of points satisfying the following equations: w*x-b=0
where w is the normal vector of the super-plane (not necessarily normalized), the parameter b/w determines the offset (that is, the distance from the super-plane to the origin) of the distance from the super-plane along the direction of the normal vector W.
Hard interval
If the training data is linear and can be divided. We can choose two parallel super planes, each of which can divide the data into two categories, making the distance between the two super-planes as large as possible. The area between the two hyper-planes is called the "interval", and the maximum interval of the hyper-plane is exactly in the middle of the two super-planes. These two super planes can be described by the following two equations:
W*x-b=1 and W*x-b=-1
In geometry, the distance between the two super-planes is 2/w, so to maximize the distance, we minimize w. Since we need to prevent data points from falling into the interval, we add the following constraint:
For each of the following I,
These constraints stipulate that each data point must fall on the right side of the interval. Can be written in the following form: Yi (w*x-b) >=1,for all 1<i<n
Put the above together is: Min w s.t yi (w*x-b) >=1, for each of the I=1...N
To make the problem solvable W and b determine our classifier: X->SGN (wx+b)
An obvious, but important, result of the geometrical description is that the maximum interval is determined entirely by the closest data point Xi to the hyper-plane, which is called the support vector.
Soft interval
In order to extend the SVM to the data set and to be non-linear, we introduce the central loss function: Max (0, 1-yi (wx+b))
This function is 0 if the constraint condition 1 is satisfied. In other words, if XI falls on the right side of the gap. For data points that fall on one side of the interval error, the value of the function is proportional to the distance to the boundary. So we want to minimize the following expressions:
The parameter Lamda determines the weight, which weighs the increase in the interval size and ensures that Xi can fall on the right side of the boundary. In this way, for a sufficiently small lamda value, the performance of the soft interval SVM is equivalent to the hard interval SVM, if the input data is linearly divided, but if it is not linear, it can still learn a feasible classification rule.
Non-linear classification
The original maximal interval Hyper-plane algorithm is proposed by Vapnik in 1963, and the algorithm constructs a linear classifier. However, in 1992, Bernhard E. Boser, Isabelle M. Guyon and Vladimir N. Vapnik suggested using kernel techniques to create a non-linear classifier on the maximum interval super-plane. The algorithm is similar in form (compared with the linear classifier), except that each dot product is replaced by a nonlinear kernel function. This allows the algorithm to conform to the maximum interval hyper-plane in a transformed feature space. This conversion can be non-linear, and the converted space is high-dimensional. Although the classifier is a hyper-planar in the transformed feature space, it can be non-linear in the original input space.
It is worth noting that although the algorithm can still perform well after given enough sample data, the generalized error of SVM is increased in the high dimensional feature space.
Some common kernel functions include: homogeneous polynomial non-homogeneous polynomial Gaussian radial basis kernel function hyperbolic tangent function
The nucleus is related to the passing of the equation K (x, y) =f (x) *f (y) conversion f (x), and the W value is also in the converted space, and the dot product of the classification and w can also be computed using nuclear techniques such as: ...
Compute SVM Classifier
Calculating the SVM classifier (soft interval) means minimizing the following expression:
We focus on the soft interval classifier because, as mentioned above, selecting a sufficiently small value for LAMDA will generate a hard interval classifier with linearly selectable input data. The traditional solution, which involves simplifying the expression (2) as a two-time planning problem, is explained in detail below. After that, recent solutions such as sub-gradient descent and the coordinate descent approach will be discussed.
Original question
The minimized expression (2) can be re-written as a constraint optimization problem, using a different objective function in the following way.
Duality problem
Support Vector Machine (SVM)