-s parameter selection in Libliner: Primal and Dual
- The optimization algorithm of Liblinear is divided into two main categories, namely, solving the original problem (primal problem) and duality problem (dual problem). The Tron optimization algorithm is used to solve the original problem, and the dual problem uses the coordinate descent optimization algorithm. Overall, the two algorithms are highly efficient, but they are also better at each other. For a small sample size, but particularly high-dimensional scenes, such as text classification, more suitable for dual problem solving, because the sample size, the calculation of the kernel matrix is not big, the subsequent optimization is more convenient. If the original problem is solved, the high-dimensional characteristic matrix should be calculated frequently in the process of derivation, and if the features are sparse, then many meaningless computations will be made and the efficiency of optimization can be affected. Conversely, when the number of samples is very large, and the feature dimension is not high, if solve duality problem, because the kernel matrix is too big, the solution is not convenient. Instead, it is easier to solve the original problem.
The following is a detailed description of Libliner and LIBSVM:
Transferred from: http://blog.chinaunix.net/uid-20761674-id-4840100.html
LIBSVM and Liblinear
Model and optimization
LIBSVM and Liblinear provide a variety of different models for the user to choose from, and the different models have their own applicable scenarios. The various models provided by LIBSVM and Liblinear are described below respectively.
Libsvm
The following is an introduction to the LIBSVM Help content, and gives the 5 models supported by LIBSVM. The models 0 and 1 correspond to the SVM classification model, 2 corresponds to the One-class classifier, that is, only one label needs to be labeled, and the Model 3 and 4 correspond to the SVM regression model.
(Default 0(multi-class classification (multi-class classification (regression( Regression)
First, look at the most basic C-SVC model. SVM can be written as the following optimization objective function (the derivation algorithm is not described in detail here):
Argminw,b,ξsubjectto 12wtw+c∑i=1lξiyi (WT. ( XI)? b) ≥1?ξi,ξi≤0,i=1,..., l
When the model uses linear kernel, that is ? ( x) =x, the above problem a standard two-time convex optimization problem, can be more convenient for each variable to be derivative. There are many fast optimization methods for solving such problems, and these methods are used in Liblinear. But if the introduction of kernel SVM, the situation is very different. Because many times we can neither get the concrete form of the kernel function nor get the new expression of the characteristic in the nuclear space. At this time, the solution thought used in linear SVM is completely out of work. In order to solve this problem, we must adopt standard SVM to solve the problem, first turn the original problem into duality, and get the following objective function (the specific process can refer to any information about SVM):
Argminαsubjectto f (α) =12αtqα?etα0≤αi≤c,i=1,..., l,ytα=0
Through the duality change, the above objective function becomes a two-th type about the variable α. Obviously, the most important thing in the objective function above is the matrix Q, which is the kernel matrix that trains the sample, satisfies the qi.j=. XI) T? (XJ). On the one hand, we can guarantee that Q is a positive definite matrix according to the definition of kernel function. In other words, the above objective function is also a convex function, and the solution which can be guaranteed by optimizing convergence is the global optimal solution, which is one of the important advantages of SVM. But the problem also comes with the use of commonly used kernel functions, as long as any given two vectors, always be able to calculate a distance of not 0. This also means that the matrix Q will be a very dense matrix, if the training sample is enough, then the storage and computation of matrix Q will become a big problem, which is the biggest challenge in SVM optimization algorithm.
Because the matrix q too large, it is difficult to optimize the whole α at once. So the common method is to first put q large unloading eight pieces, each time select part of the q, and then update with this part q related α. One of the most famous algorithms is the SMO algorithm proposed by John C. Platt, and the LIBSVM optimization process is based on the SMO algorithm. Each iteration of the SMO algorithm chooses the smallest optimization unit, which is to fix the other α, selecting only two α values for optimization. The reason for not choosing one is because there are ytα=0 constraints, at least select two α coordinates to be possible to update. The main purpose of this article is to introduce LIBSVM, so we will not discuss the details of SMO in detail. As for the implementation of the specific algorithm in LIBSVM, in LIBSVM's official paper is described in detail, here to summarize some of the key issues:
- Working Set, which is the selection of the Alpha part that needs to be optimized
- Settings for iteration Stop conditions
- Alpha Update algorithm, which is the solution to every step problem
- Shrinking, that is, remove some of the already satisfied α, speed up the convergence rate
- Cache, the matrix needs to be cached when the Q matrix is too large.
Each of the above questions is not easy to handle. As a user, there may be no need to be familiar with all the details. I think the two problems that need to be recognized most are: 1 The objective function of SVM seems to be a standard optimization problem, but the actual solution is much more complicated. In order to improve the speed of the solution, it is necessary to optimize the algorithm and to do engineering improvement. If you just simply follow the textbook method, or even directly call some of the optimized toolkit to implement the SVM algorithm, at most even a demo. It's not easy to be able to really write an efficient and stable SVM tool that can handle large-scale data. So it is much simpler to use LIBSVM than to implement the algorithm. 2) The reason why SVM is optimized is because the problem itself is difficult to compute and store. So even with so much optimization, the efficiency of the algorithm is still low. So we need to pay attention to the details of various procedures, improve the efficiency of the operation. In addition, when the sample size is too large, sometimes in order to make full use of the data, also have to reluctantly, abandon the use of kernel.
The
>C-SVM,LIBSVM, in addition to the standard Ν-SVM and C-SVM algorithm and application scenarios are basically the same, the only difference is the original parameters c becomes the parameter ν. C-SVM parameters c adjustment range in [0,+∞) , and Ν-SVM the corresponding parameters ν the adjustment range into Ν-SVM more explanatory, sometimes in the parameter settings can also provide some convenience. But Ν-SVM and C-SVM there is no intrinsic difference, through the adjustment of parameters, the two can achieve the exact same effect. So in the use of LIBSVM processing classification problem is, choose any of the above methods are OK, just need to follow their own habits.
One-class SVM is also a classification method supported by LIBSVM. As the name implies, when you use one class, you only need to provide a class of samples, and the algorithm learns a small spherical surface to wrap all the training samples. One-class SVM looks tempting because we often encounter a class of samples that require learning the classifier. But in fact, on the one hand many times we get a positive sample in the sampling process there is a large deviation, resulting in the learning of the one class classifier does not necessarily take into account all the positive samples of the situation, on the other hand, most of the problems still exist many methods of constructing artificial negative samples. According to my experience, the common SVM effect is usually better than One-class SVM, and the use of One-class SVM in real scenes is not much. Therefore, it is necessary to study the problem in more depth before using this method.
Finally, LIBSVM supports the SVM-based regression model, the SVR. Similar to the classification model, SVR is also divided into c-svr and ν-svr. The target function of SVR is slightly different from SVM's classification model. Because the regression problem predicts that the deviation from the target value can be much smaller, the SVR uses two slack variable to characterize the error boundary of the prediction. Although there is such a difference, but the basic idea of the two and optimization algorithm and still basically consistent.
In the implementation of LIBSVM, the above five models, namely C-SVM,ν-svm,one-class SVM,c-svr,ν-svr, can eventually be transformed into a more general optimization framework, and then solved with the same strategy, This is also the main function that LIBSVM implements. In actual use, the most commonly used method is C-SVM, which is the most traditional SVM classification model.
Liblinear
Liblinear was developed after LIBSVM popular for many years, the problem to be solved is also more simple than LIBSVM, its advantage mainly lies in efficiency and scalablility. This advantage exists because linear SVM is much simpler to solve than kernel SVM.
From the above duality problem, the solution of SVM is ytα=0 to a large extent , so it is necessary to select a set of α to optimize each time. If we traced the source of this constraint term, we can find that this item is obtained by making the model's constant term B derivative 0. In a linear model, we can simply trick, x=[x,1] and w=[w,b], so that the constant term in the model does not exist. Of course, such trick can only be applied in linear models. Without the above constraint, the objective function of optimization becomes:
Argminαsubjecttof (α) =12αtqα?etα0≤αi≤c,i=1,..., l
At this time, you can select only one like at a time to optimize, each round to traverse all the dimensions of α, multi-iteration, until the final convergence. Such an optimization algorithm is called coordinate descent (coordinate descent method). By using the particularity of linear function, the vector representation of W can be calculated directly according to α , so the efficiency of the algorithm is greatly improved. The specific optimization algorithm can be referenced in document A Dual coordinate descent Method for large-scale Linear SVM.
To change the angle of the problem, the objective function of linear SVM can be written in the following form:
ARGMINW12WTW+C∑I=1L (Max (0,1?YIWTXI))
To further abstract the problem, a class of classification problems can be written in the following form:
argminwω (w) +c∑i=1l? (YI,WTXI)
The error function is used to measure the loss of the predicted value and the target value. In the case of linear SVM above, there are
? (Yi,wtxi) =max (0,1?YIWTXI)
This is called Hinge Loss.
As in the logistic regression, loss function is defined as
? (Yi,wtxi) =log (1+e?yiwtixi)
Omega is commonly referred to as regularization (Regularizer), the most commonly used is the previous one ? 2-norm, writing WTW, can also write ∥w∥22, that is, the sum of squares of all elements in the vector W. In addition to the 2-norm,1-norm often uses regularizer and brings some special effects (discussed later). A large number of supervised learning models can be written in the form of loss function + Regularizer, while parameter C controls the proportion of the two in the final loss function. The selection of different loss function and Regularizer and the balance between them is one of the most important topics in machine learning.
For the above problem, there are many mature algorithms can be used to solve the model, such as the steepest gradient method, Newton method, for large sample size, you can also use the method of random gradient training. In general, because the second derivative is considered, the optimization efficiency of Newton's method is higher than that of the first-order derivative. But because Newton's method itself has many limitations in computational quantity and convergence, it is seldom used directly, but is improved on the basis of Newton's method. Among the algorithms commonly used are BFGS and L-BFGS. Specifically to the Liblinear software package, the author uses the Trust Region Newton (TRON) method to improve the traditional Newton method, which proves to be more efficient than L-BFGS training.
The training of L-2 SVM and logistical regression model based on Tron method is implemented in Liblinear. The L2-loss SVM is a variant of the standard SVM, and the loss function becomes:
? (yi,wtxi) = (max (0,1?YIWTXI)) 2
From the actual effect, the L2-loss SVM is not much different from the standard L1-loss SVM. But in the calculation, the derivation form of the former is more simple, which facilitates the calculation and optimization of the gradient. Liblinear does not implement the standard L1-loss SVM implementation of the Trust region Newton method, on the one hand, because the direct hinge loss derivation needs to be discussed more complicated, on the other hand L2-loss SVM can be directly replaced L1-loss Svm. However, in some other packages, such as Svmlin, the original problem of L1-loss SVM is solved, but the optimization algorithm used is L-bgfs instead of Tron.
Summarize
This paper introduces the optimization algorithms of LIBSVM and Liblinear, and briefly summarizes the application scenarios of different algorithms:
- All linear problems are using liblinear instead of LIBSVM.
- LIBSVM in the different algorithms, such as C-SVM and NU-SVM in the model and the solution is not the essential difference, just do a parameter transformation, so choose their own habits of good.
- The optimization algorithm of Liblinear is divided into two main categories, namely, solving the original problem (primal problem) and duality problem (dual problem). The Tron optimization algorithm is used to solve the original problem, and the dual problem uses the coordinate descent optimization algorithm. Overall, the two algorithms are highly efficient, but they are also better at each other. For a small sample size, but particularly high-dimensional scenes, such as text classification, more suitable for dual problem solving, because the sample size, the calculation of the kernel matrix is not big, the subsequent optimization is more convenient. If the original problem is solved, the high-dimensional characteristic matrix should be calculated frequently in the process of derivation, and if the features are sparse, then many meaningless computations will be made and the efficiency of optimization can be affected. Conversely, when the number of samples is very large, and the feature dimension is not high, if solve duality problem, because the kernel matrix is too big, the solution is not convenient. Instead, it is easier to solve the original problem.
Multi-Classification problem
LIBSVM and Liblinear support multi-classification (Multi-Class classification) issues. The so-called multi-Classification problem, that is, each sample category tag can be more than 2, but the result of the final prediction can only be a category. For example, the classic handwritten numeral recognition problem, the input is an image, the last output is 0-9 of these 10 numbers of one.
LIBSVM and Liblinear but the way of implementation is completely different. LIBSVM takes a strategy of one vs single, which is to train a classifier between all categories 22. In this way, if there is a K class, it is theoretically necessary to train K (k?1)/2 classifiers. In fact, LIBSVM in this step is also a certain optimization, using the existing classification of the relationship, reduce the number of classifiers. However, LIBSVM has to train the classifier multiple times on multiple classification issues. However, given the LIBSVM optimization method mentioned earlier, the complexity of training increases linearly with the increase in the number of samples. Through the 1VS1 strategy, we can ensure that the sample size of each sub-classification problem is not too much, in fact, it is convenient for the entire model training.
Liblinear, however, takes another training strategy, one vs all. Each class corresponds to a classifier, and the sub-sample is all of the other classes of samples. This is more efficient than one vs one because liblinear can and need to handle a much larger scale of training than LIBSVM. In addition, Liblinear implements the SVM multi-classification algorithm based on the Crammer and singer method, and learns each class corresponding classifier in a unified objective function.
Output file
In general, we use Liblinear or LIBSVM, we can directly call the system training and prediction function, do not need to directly contact the training obtained model files. But sometimes we may need to implement the prediction algorithm on our own platform, it is inevitable to parse the model file.
Because Liblinear and LIBSVM have different training models, their corresponding model file formats are different. The format of Liblinear training results is relatively simple, for example:
1 solver_type l2r_l2loss_svc_dual 2 nr_class 3 3 Label 0 1 2 4 nr_feature 5 5 bias-1 6 W 7-0.4021097293855418 0.1002472498884907-0.1619908595357437 8 0.008699468444669581 0.2310109611908343-0.2295723940247394 9- 0.6814324057724231 0.4263611607497726-0.4190714505083906 10-0.1505088594898125 0.2709227166451816-0.1929294695905781 11 2.14656708009991-0.007495770268046003- 0.1880325536062815
The above Solver_type represents the solution algorithm, and W below represents the model weights that are solved. Each of these columns corresponds to a class classifier, and each row corresponds to a dimension of the feature. Where Nr_class represents the number of solutions, Nr_feature represents the dimension of the feature, bias represents the bias of the model, and can be manually set weights. What's easy to misunderstand here is the Label field, which represents the number of columns in each user's training file that have a label corresponding to W. For example, in the model above, the user specifies that the classifier with number 0 corresponds to the first column of W. But the above correspondence does not exist, for example, in the two classification scene, with the whole sample labeled 1, Negative sample labeled 0, but in the model training, Liblinear will be in accordance with their own numbering system training, there may be negative samples in front, positive samples in the latter case. At this point, it is necessary to match the Liblienar internal numbering system with the real user tag according to label 1 0. Of course, later Liblinear and LIBSVM did some optimizations, when the two classification, if the positive and negative sample labels are 1 and +1, then you can always ensure that the positive sample appears in the first column of W. But this mechanism is not completely reliable, for example, in the Liblinear Spark implementation code, there is no implementation of this feature, once I was the whole miserable. So there is a need to be very careful here.
The LIBSVM training result format is more complex, for example:
1 Kernel_type RBF2 Gamma 0.07692313 Nr_class 34 TOTAL_SV 1405 rho-1.04496 0.315784 1.030376 label10-17 NR_SV2 2 1 8 SV 9 0 1 1:0.583333 2:-1 3:0.333333 4:-0.603774 5:1 6:-1 7:1 8:0.358779 9:-1 10:-0.483871 12:-1 13:1 10 < Span class= "M" >0 0.6416468628860974 1:0.125 2:1 3:0.333333 4:-0.320755 5:-0.406393 6:1 7:1 8:0.0839695 9:1 10:-0.806452 12 : -0.333333 13:0.5 11 0 1 1:0.333333 2:1 3:-1 4:-0.245283 5:-0. 506849 6:-1 7:-1 8:0.129771 9:-1 10:-0.16129 12:0.333333 13:-1 12 0.2685466895842373 0 1:0.583333 2:1 3:1 4:-0.509434 5:-0.52968 6:-1 7:1 8:-0.114504 9:1 10:-0.16129 12:0.333333 13:1 13 0 1 1:0.208333 2:1 3:0.333333 4:-0.660377 5:-0.525114 6:-1 7:1 8:0.435115 9:-1 10:-0.1935 12:-0.333333 13:1
The meaning of the above parameters is more direct, it should be noted that the SV is followed by the training of the model parameters, to support the way the vector storage. NR_SV gives the corresponding model for each support vector, such as "2 2 1", which means that the first two lines are the support vectors labeled 1, the next two are the support vectors labeled 0, and the last line is the support vector for the class labeled-1. For each line of support vector, in the above model, because there are three classes, so each support vector may be present in two classifiers, so the number of the first two columns corresponding to the remaining two classification as a support vector when the alpha value, followed by the true support vector.
-s parameter selection in Libliner: Primal and Dual