Preface:
This article describes Ng's notes about machine learning about SVM. I have also learned some SVM theories and used libsvm before. However, this time I have learned a lot about Ng's content, and I can vaguely see the process from Logistic model to SVM model.
Basic Content:
When using the linear model for classification, You can regard the parameter vector as a variable. If the cost function is represented by MSE, the image of the cost function is similar to a parabolic (using the new variable as the independent variable ); if the price function is expressed by the cross-logarithm formula, the price function curve is as follows:
In logistic regression, the cost function can be expressed as follows:
The first item is the error of the training sample, the second item is the penalty item of the weight coefficient, and λ is the weight penalty coefficient, where M is the number of training samples, our goal is to find the appropriate θ to minimize the cost function value. Obviously, this optimization process is irrelevant to M. Therefore, we can remove M, and place the variable λ weighing the prediction error and weight penalty before the first item, and change it to C, the cost function becomes the cost function of SVM. The formula is as follows:
The cost1 and cost0 functions indicate the corresponding cost functions when the labels of the training samples are 1 and 0.
SVM classifier is also called large Margine classifier. Its goal is not only to find the UI of the training sample, but also to maximize the distance between samples separated from the UI, because in an intuitive sense, such classifier has a higher generalization ability, the preceding linear model and logistic model classifier do not have this capability.
If the C value in the cost function in SVM is very large (hundreds of thousands), the accuracy of the training sample can be very high, or even the first error can be changed to 0, at this time, the cost function optimization can be converted:
However, if C is too big, it is not necessarily a good thing. If C is too big, it may seriously consider every outliers, but make the generalization ability worse. Of course, these are only macro-level understandings, to make precise adjustments, we still need to do nothing without mathematics.
Ng reviewed a basic mathematical knowledge point in introducing SVM mathematical theory: the inner product of vector u and v can be calculated as follows: u'v = p * | v |, P is the projection of vector V on vector u, which can be positive or negative.
When C is very large, the target function becomes half of the sum of squares of the parameters. We require that the value of this target function be the smallest and the inner product value between the parameter and each sample be greater (when a positive sample is used, the value must be greater than 1, and the negative sample must be smaller than-1). The above mathematical knowledge can tell that the projection distance of each sample on the θ parameter must be large, the θ vector that meets the preceding requirements is the method vector of the corresponding demarcation plane. The corresponding segmentation interface is what we need, this is the reason why the SVM classifier is usually called the Large Margine classifier.
Next, let's look at the introduction of the kernel method. First, let's look at a non-linear classification problem, as shown in:
In general, we will establish a high-dimensional parameter model to fit the demarcation curve. We can use the x1, x2, x1x2, X1 ^ 2, x2 ^ 2 is considered as the features of the original sample (the features here are no longer simple X1 or X2, but the combinations of their product ), the θ parameter is still the model parameter we need to learn. However, in the SVM theoretical framework, we need to use new features to convert the above x1, x2, x1x2, X1 ^ 2, X2 ^ 2... Use F1, F2, F3... And the source of new feature F is: similarity between the original Input Feature X and a certain L (lower-case l), for example:
Therefore, when training the SVM model, when the label is 1, the following conditions must be met:
The problem is how to determine L, that is:
When using SVM classifier, the general practice is to treat each training sample as an L, and the training sample and each l can calculate an F. Therefore, assume that there are m samples, each sample X is n-dimensional, then the n-dimensional feature X is now a new M-dimensional feature F. Therefore, the SVM cost function becomes:
There are many SVM variants. Some of the expressions of the cost functions are mainly different from those of the following weights.
Many parameters need to be adjusted when using the SVM library. Taking Gaussian Kernel as an example, at least C and σ parameters need to be taken into account. c is the penalty coefficient of the error items of the training sample, when the value is too large, it fits the training sample well (bias is small), but the effect on the test sample is poor (variance is large), that is, it is an over-fitting phenomenon, and vice versa. σ is the variance coefficient of Gaussian Functions. When this value is too large, the new features learned are smoth, which is an underfitting phenomenon and vice versa.
In SVM, linear kernel is used instead of kernel, which is generally applicable when the input sample has a large feature x dimension n, when the number of training samples is m. However, when n is small and M is large, non-linear kernel is usually used. Common Gaussian Kernel is used.
Of course, other kernel can be used, but the Mercer theorem must be satisfied. Otherwise, the process will not converge When the SVM library is used for optimization. Common kernels include polynomial kernels. The expression is (x' * θ + Contant) ^ degree. It has two parameters to choose from. String kernel (mostly used for text processing ), chi-sqare kernel and histogram interection kernel. However, according to ng himself, he used Gaussian Kernel in his own life, and almost never used other kernel, because different cores are not very different in general classification problems (special problems may be different ).
Note that the training sample data must be normalized when SVM is used.