Individual understanding of SVM---easy to understand

Source: Internet
Author: User
Tags svm rbf kernel

Original: http://blog.csdn.net/arthur503/article/details/19966891

Before thinking that SVM is very powerful and mysterious, I understand the principle is not difficult, but, "the master's skill is to use the idea of mathematics to define it, using physical description of it," this point in the mathematical part of the SVM has been deeply realized, the least squares, gradient descent method, Lagrange multiplier, The duality problem and so on are engaged in the burn. After listening to the lecture in Pui Yuen, it was clear to understand the whole mathematical derivation of the ins and outs.

1. Why should we study linear classification?

First of all, why the data set must be said to be linear separable or linear is not separable, can not be non-linear separation? Want to be non-linear apart of course, SVM simply maps the original linearly irreducible data points to a new space and translates them into linear separable data in the new space for classification. If you return to the original data space, it is still non-linear separate. But why not just separate the nonlinearity in the original data space, but instead go to the new space to separate it linearly? First, nonlinear separation is much more complex than linear separation. Linear separation as long as a straight line or a plane, and so on, can be said to be the simplest form of expression in the curve. and the non-linear separation of the situation is more. In the case of two-dimensional space only, there are too many curves, polylines, hyperbolic, conic, wavy lines, and irregular curve, and there is no way to deal with them uniformly. Even if we can deal with a specific problem to obtain a non-linear classification results, and can not be well extended to other situations, so that every specific problem to a mathematician to build a curve model, too troublesome and not so much time and energy. Therefore, the use of linear classification one is because it is simple, the nature is easy to study thoroughly, second, because of its strong ability to promote, after the study, all other problems are solved, no need to build other models. So, although SVM is more of a step to map raw data to a new space, it seems to increase the workload, and how to find new mapping space is not easy to look at, but, overall, after the study, it will save a lot of effort than other methods.

2. What is the idea of SVM? 2.1 Hard interval Support vector machine

One of the most critical ideas in SVM is the introduction and definition of the concept of "interval". The concept itself is simple, with two-dimensional space as an example, that is, the distance between the point and the categorical line. Assuming that the line is y=wx+b, the line is the best categorical line as long as the sum of the distance from all the positive classification points to the line is maximized from all negative classification points to the straight line. In this way, the original problem is transformed into a constrained optimization problem, which can be solved directly. This is called hard interval maximization, and the resulting SVM model is called the hard interval support vector machine .

2.2 Soft interval Support vector machine

But new problems arise, and in practical applications, the data we get are not always perfectly linear, where there may be individual noise points that they mistakenly classify into other classes. If these specific noise points are removed, they can be easily divided linearly. However, we do not know which of the data sets are noise points, and if solved in the previous method, it will not be linearly separated. Is there no way out? Assuming that the y=x+1 line is divided into two categories, if each of the two categories have each other's several noise, in the eyes of the person, can still be divided into two categories. This is because in the human brain can tolerate a certain error, still use y=x+1 line classification, can be the smallest error in the case of the best classification. In the same way, we introduce the concept of error in SVM and call it " relaxation variable ". By adding relaxation variables, it is necessary to add the error of new relaxation variables in the original distance function, so that the final optimization objective function becomes two parts: distance function and relaxation variable error. The importance of these two parts is not equal, but needs to be based on specific problems, so we add the weight parameter C, and the objective function of the relaxation variable error multiplied, so that you can adjust the C to reconcile the coefficients of the two. If we can tolerate noise, then the C is small, let his weight down, and thus become unimportant; on the contrary, we need a very strict model of noise, it will be a little bit C, the weight up, become more important. By adjusting the parameter C, the model can be controlled. This is called soft interval maximization, and the resulting SVM is called a soft interval support vector machine .

2.3 Nonlinear Support vector machines

The previous hard interval support vector machine and the soft interval support vector machine are the problems of solving linear data sets or approximate linear data sets. But what if there is a lot of noise, and even the data becomes linearly non-divided? The most common example is in a two-dimensional plane Cartesian coordinate system, with the origin (0,0) as the center, with a radius of 1 to draw a circle, then the point in the circle and the point outside the circle in the two-dimensional space is certainly not linearly separated. However, learning the geometry of junior middle School knows that the point within the circle (including The circle): x^2+y^2≤1, outside the circle is x^2+y^2>1. We assume that the third dimension: Z=x^2+y^2, then in the third dimension space, you can determine whether the point is inside or outside the circle by whether the z is greater than the first. In this way, the linearly irreducible data in a two-dimensional space can be easily divided in the third dimensional space. This is the nonlinear support vector machine .

This is the very important idea of SVM. For data that is not linearly divided in n-dimensional space, the space above the n+1 dimension is larger to the possibility of becoming linearly divided (but not necessarily linearly on the n+1 dimension. The higher the dimension, the more likely it is to be linearly divided, but not fully guaranteed. Therefore, for linearly non-divided data, we can map it into a new space which can be divided linearly, and then we can solve it by using the hard interval support vector machine or the soft interval support vector machine just mentioned. In this way, we turn the original problem into how to map the original data so that it can be linearly divided in the new space. In the example above, the mapping can be done by observing the equations of the circle, but it is certainly not so simple in the actual data. If you can observe the law, then there is no need for the machine to do SVM.

In practice, it is very difficult to find a suitable space for a real problem function, fortunately, in the calculation, we need only two vectors in the new mapping space of the inner product results, and the mapping function exactly what is not need to know. This is not very good understanding, some people will ask, since do not know the mapping function, how can we know the mapping in the new space in the inner product result? The answer is in fact possible. This requires the introduction of the concept of kernel functions. The kernel function is such a function: Still take two dimensional space as an example, assuming for the variables x and Y, mapping it to the new space mapping function is φ, in the new space, they correspond to φ (x) and φ (y), their inner product is <φ (x), φ (y) >. We make the function kernel (x, y) =<φ (×), φ (y) >=k, as you can see, the function kernel (x, y) is a function of x and Y! And it has nothing to do with φ! What a good nature it is! We no longer have to do with φ specifically what mapping relationship, only need to calculate kernel (x, y) to get their inner product in the high-dimensional space, so that you can directly into the previous support vector machine calculation! Really mother no longer have to worry about my study.

After getting this delightful function, we need to calm down and ask: Where does this kernel function come from? And how did he get it? Can you really solve all the problems that map to high-dimensional space?

I'll try to answer that question if I understand the right thing. Kernel functions are not well-found and are generally derived or pieced together by mathematicians. Now we know that there are polynomial kernel functions, Gaussian kernel functions, string kernel functions, and so on. The support vector machine corresponding to the Gaussian kernel function is the Gaussian radial basis function (RBF), which is the most commonly used kernel function.

The RBF kernel function can extend the dimension to the space of infinite dimension, so it can meet all the needs of mapping theoretically. Why is it an infinite dimension? I didn't quite understand that before. Later, the teacher said that the RBF corresponds to the Taylor series expansion, in the Taylor series, a function can be decomposed into an infinite number of sums, where each item can be regarded as a corresponding dimension, so that the original function can be considered to be mapped to the infinite dimension of the space. Thus, in practical applications, the RBF is the relatively best choice. Of course, if you have research, you can choose other kernel functions, which may be better on some issues. However, RBF is a kernel function that has a good effect on the widest range of problems without understanding the problem. Therefore, the scope of use is also the widest.

In this way, the linear non-divided data can also be mapped to high-dimensional and even infinite-dimensional space by the RBF kernel function, and the problem can be solved by maximizing the calculation interval and relaxation variables. Of course, in the solution, there are some mathematical techniques to simplify the operation, for example, the use of Lagrange multipliers to transform the original problem into a dual problem, you can simplify the calculation. These are not used in the experiment, and the mathematical principle is a bit difficult, first of all.

==========

Original: http://blog.csdn.net/viewcode/article/details/12840405

SVM's article introduced a lot, many introductions are very detailed, but I do not understand, and never understood the mystery.

This time, I will use my superficial language to domination the veil between me and SVM.

1. What problem does SVM solve? Before, rushed up to see the application of SVM, Introduction, optimization calculation method and so on. I never really thought about what the SVM was going to solve.   The following is a commonly used diagram to explain the requirements of SVM. The most basic application of &NBSP;SVM is classification. Solve the optimal classification surface and then use it for classification.   Definition of the optimal classification surface:  for SVM, there is a classification polygon, the minimum distance between two point sets to this plane is maximum, and the distance from the edge point in two points set to this plane is the largest.   from the intuitive point of view, the left side, is certainly not the optimal classification surface, and the right side can make people feel the distance is greater, the use of more support points, at least the use of three classification surface, should be the best classification surface.   So, is it not an optimal classification polygon that requires two or more three points to determine that? This depends on the actual situation.   For example, the left image is an optimal classification surface determined by three points, the two points of different categories determine a central point, while the homogeneous two points can determine the direction vector. This optimal classification surface requires three points. But for the right image, it is the optimal classification surface to get the vertical surface of the two points of different classes directly. This classification surface requires two points. Above, the situation analysis, makes the solution to the optimal classification surface idea, the pattern is more complex. If you use the exhaustive method, you need at least the following procedure. The two points of different classes are taken first to solve the vertical surface of the center connection. As shown above, and then determine the distance from other points to the vertical surface, if there is a smaller distance (or negative, that is, classification error), then select the above left graph mode.   Poor lifting all points. The most direct way to deal with the complexity of its operations is m*n*n, if n > m.  This has not been used to high-dimensional mapping, if coupled with high-dimensional mapping processing, the algorithm is probably more complex. Therefore, the method of exhaustive is not very realistic.  2. From intuition to mathematical inference from intuitive to fit:  intuitively, there is an optimal hyper-plane. So, let's assume that the formula for this optimal surface is:  w * x + b = 0,  So for all the point set X, there is a boundary surface of the point set parallel to the optimal hyper-plane, W * XI + B >= 1 or W * XI + b <=-1, where Yi can To 1,-1  maximize the distance of these two parallel super-planes. That is, max  2/| | w| |   or Minimize w, that is min | | w| |   Another condition is  w * XI + b >= 1 or W * XI + b <=-1.   This is a bit more than usual calculation method (if not learned optimization theory), because both the extremum, and the inequalityIn. This is the typical QP (Quandratic programming) Two-time planning problem. In the high number there is a theory about the extremum, and the Lagrange multiplier method is used, but the condition is the equation.   Therefore, it is necessary to convert the inequality into the form of an equation. Method introduces variables. Each point is given a factor α, if the boundary point, then α is greater than 0, or 0.  αi * Yi * (W * XI + b) = 0.  on the other hand, like can also be considered as Lagrange coefficients, using Lagrange multiplier method to find the extremum. Because the like is also unknown. So, we need to find out like.   

That is, MIN (max l ), Max L is because the next hyper-planar formula is <= by a minus sign, and its summation is a maximum value of 0.

First, to the Min to find the Extremum, the W, and b differential. deduced the following relationship (blog No formula editor, want to lazy as long as the clip) finally launched a simple point formula. From Min to Max is also a dual conversion process, also known as dual Max Extremum, and, only one equality constraint condition, the disadvantage is that the unknown variable also increased. The next step is to use the optimization method to find the extremum. For the unknown variable, take an initial value, and then use the points in the point set, one by one, to train. Until the unknown variable converges. 3. SMO solution SVM from the simple boundary classification idea, to the complex Lagrange solution. In fact, for the two-time planning problem, there is a classical steepest descent method, Newton method and other optimization solution method. SMO is an optimization algorithm for SVM, which avoids the classic two-time programming problem. Eliminate W and convert to like solution. This is a more efficient solution to the use of KKT conditions, coupled with a bunch of inferences, finally the following formula: or so many formulas and terminology, it really makes me headache. Can only remember, the back slowly digest. Principle Understanding: Like *αj * ... In fact, it is still a multi-dimensional planning problem, so we should make a few assumptions: 1. Assuming that all but α1 are fixed values,∑n i= 1αiyi=0,   α1 can be determined directly, it cannot be optimized.
/span> 2. If there is α1,α2 is a variable, the other is a constant, α2 can be represented by α1, substituting into the target function, forming a unary two-time function. This makes it easy to find the extremum. Among them, the constraints are still to be considered:α-i- α I0 <= ai <= C. In short, to find the extremum is more convenient and feasible.   Using this method, we choose different  αi,  αj to find the extremum. Then select the largest. SMO uses this principle, but it does not select  α in sequence or randomly, but uses heuristic algorithm to select the best two dimensions.  john C. Platt's thesis Fast Training of support Vector machines Using sequential Minimal optimization, there are principles, there are pseudo-code can be referenced.  http://blog.pluskid.org/?page_id=683 Introduction is also relatively easy-to-do.  3. What are the types of SVM, the spatial complexity of the application scenario and the SVM: SVM is the amount of memory and the square of the sample data. "A Tutorial on support vectors machines for Pattern recognition"  1998kluweracademicpublishers,boston, training computational complexity in O (Nsv ^3+LNSV^2+D*L*NSV) and O (d*l^2), where NSV is the number of support vectors, L is the number of training set samples, D is the dimension of each sample (the original dimension, not the dimension before mapping to a high-dimensional space) .  total, The SMO algorithm of SVM is based on different application scenarios, and its algorithm complexity is between ~n and ~n^2.2, while the complexity of chunking scale is between ~n^1.2 and ~n^3.4. The general SMO has a first-order advantage over the chunking algorithm. Linear SVM is slower than the SMO algorithm for nonlinear SVM. So, according to the original paper, the SMO algorithm is 1000 times times faster in linear SVM and 15 times times faster in nonlinear.   The memory requirements for the SVM SMO algorithm are linear, which makes it possible to apply a larger set of training.   Therefore, if the amount of data is large, SVM training time will be longer, such as spam classification detection, not using the SVM classifier, but the use of a simple naive Bayes classifier, or the use of logistic regression model classification.  ---------------------Other viewpoints: SVM can get much better results than other algorithms in the small sample training set. Support Vector Machine is one of the most commonly used and best-performing classifiers in the world, which is due to its excellent generalization ability, because its optimization goal is the most structural riskSmall, rather than empirical risk minimization, therefore, through the concept of margin to obtain a structured description of the data distribution, thus reducing the data size and data distribution requirements. &NBSP;SVM is not better than other algorithms in any scenario, and it is best to try out multiple algorithms for each application and then evaluate the results. such as SVM in the message classification, but not as logical regression, KNN, Bayes effect is good. &NBSP;&NBSP;SVM the meaning of each parameter? The parameters of the SIGMA:RBF kernel function are used to generate high-dimensional features, and there are several kernel functions commonly used, such as radial kernel functions, linear kernel functions, which need to be chosen by experience. C: Penalty factor. In the optimization function, the penalty factor of outliers is also reflected in the degree of attention to outlier points. This is also chosen by experience and experimentation. &NBSP;SVM Type:C-SVM: Classification SVM, the parameter that needs to be tuned has the penalty factor C, the kernel function parameter. The value of C 10^-4, 10^-3, 10^-2,... to 1, 5 ... Turn biggerNU-SVM: Classification SVM, to some extent the same as C-SVM, the penalty factor C is replaced by the Factor Nu. Its most optimized functions are slightly different. The value of Nu is 0-1 and is generally taken from 0.1 to 0.8. 0 represents the smallest number of samples falling within the interval, and 1 indicates that the sample can fall into the interval can be many cases. The exact words on the wiki: the main motivation for the Nu versions of the SVM are that it have a have a more meaningful interpretation. This was because NU represents an upper bound on the fraction of training samples which be errors (badly predicted) and a Lower bound on the fraction of samples which is support vectors. Some users feel nu is more intuitive to use than C or EPSILON.&NBSP;&NBSP;C-SVR: SVM model for regression nu-svr: Ibid   ---------- -----------------4. Other related concepts: VC dimension: To classify n points, such as divided into two categories, then there can be 2^n, that is, can be understood as a 2^n learning problem. If there is a hypothetical H, the problem of 2^n can be correctly categorized. Then the number of these points N, is the VC dimension of H. This definition is so stiff that it can only be remembered first. One instance of the VC dimension for the linear division of 3 points on a plane is 3. On the plane, the VC dimension is not 4, because there is no 4 sample points, can be divided into 2^4 = 16 partitioning method, because the diagonal two pairs of points can not be divided into two categories of linear. More generally, in R-dimensional space, the VC dimension of the linear decision plane is r+1.   Confidence Risk: The classifier classifies the unknown sample and obtains the error. Also called the desired risk. Experiential risk: A well-trained classifier that re-classifies the training samples. That is, sample error structure risk: [confidence risk, empirical risk], such as (confidence risk + experience risk)/2  confidence Risk factors are: Training sample number and classification function of the VC dimension. The number of training samples, that is, the more samples, the confidence risk can be smaller; the larger the VC dimension, the more kinds of solution, the more the promotion ability is worse, the greater the confidence risk. Therefore, to increase the number of samples, reduce the VC dimension, toReduce the confidence risk.   and the general classification function, we need to improve the VC dimension, that is, the characteristic data of the sample, to reduce the empirical risk, such as the polynomial classification function. As a result, the confidence risk is higher, and the structural risk becomes higher correspondingly. Learning Overfit, is the reason to believe that the risk is getting higher.   Structural risk minimization SRM (structured risk minimize) is the consideration of both empirical and structural risks. In the case of small sample, get better classification effect. To ensure the classification accuracy (experience risk), reduce the VC dimension of the learning machine, can make the learning machine in the entire sample set of the expected risk control, this should be the principle of SRM.   When the training sample is given, the larger the classification interval, the smaller the VC dimension of the corresponding categorical hyper-plane set. (requirements of the classification interval, the effect on the VC dimension)   The former is to ensure the minimum of empirical risk (empirical risk and expected risk depends on the choice of learning machine function family), and the latter makes the maximum classification interval, which leads to the smallest VC dimension, in fact it is the smallest confidence range in the promotion field. So as to achieve the minimum real risk.    training samples In the case of linear can be divided, all the samples can be correctly classified (this is not the legendary yi* (w*xi+b)) >=1 conditions), that is, the experience of risk remp 0, by maximizing the classification interval (eh, this is φ (w) = (*W*W), so that the classifier to obtain the best promotion performance.   for linear non-divided conditions, can allow the wrong points. That is, the classification interval is reduced for outliers. The farther away from the original classification surface, the more serious the outliers, the distance can be expressed by a value-relaxation variable, and only the outliers have relaxation variables. Of course, to limit this value, that is, in the minimization of the function, add a penalty, there is a person can be set to punish the C. When C is infinitely large, then it is degraded to the hard interval problem, the outliers are not allowed, the problem may be no solution. If c=0, ignore outliers. Sometimes the C value needs to be tried several times to get a better value. There is a lot of analysis in this, back to learn.   Nuclear function: To convert a completely non-divided question into a state that can be divided or reached an approximation. Relaxation variables: Solve the problem of approximate can be divided.

Individual understanding of SVM---easy to understand

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.