System Learning Machine learning SVM (iii)--LIBLINEAR,LIBSVM use collation, summary

Source: Internet
Author: User
Tags constant svm rbf kernel

1.LIBSVM and Liblinear differences, simple source analysis.

http://blog.csdn.net/zhzhl202/article/details/7438160

http://blog.csdn.net/zhzhl202/article/details/7438313
LIBSVM is a software that integrates support vector machines (c-svc, nu-svc), regression, and distribution estimation (One-class SVM). and support multiple categories of classification.

Liblinear, a linear classifier mainly implemented for millions data and features.

Both of them are used for classification, and relatively libsvm are used in a wide range of applications, while the liblinear is primarily designed to handle the training process of large data volumes. In what case, the choice liblinear instead of the LIBSVM. The author gives several suggestions:
1. When you're faced with massive amounts of data, the bulk of this is usually millions. The massive data is divided into two levels: the number of samples and the quantity of features.
2. Use linear and nonlinear mapping to train the model to get similar results.
3. The time efficiency of model training is high.

In such cases, it is recommended that you use Liblinear instead of LIBSVM

2.Liblinear use, Java version

Http://www.cnblogs.com/tec-vegetables/p/4046437.html

3.Liblinear use, official translation.

http://blog.csdn.net/zouxy09/article/details/10947323/

http://blog.csdn.net/zouxy09/article/details/10947411

4. Here is an article, write good. Transferred from: http://blog.chinaunix.net/uid-20761674-id-4840097.html


For the past more than 10 years, support vector machines (SVM machines) have been the most influential algorithms in machine learning. In the various implementation tools of the SVM algorithm, the toolkit developed by Linzhiren Teachers of National Taiwan University LIBSVM, and undoubtedly the most influential. 2011 LIBSVM's System Presentation paper "Libsvm:a Library for support Vector machines" was published in the Journal ACM TIST (ACM Transactions on Intelligent Systems an D technology). 2011, the issue of the impact factor is less than 1, but to 2014 years, its impact factor actually reached 9.39, the Tpami have thrown a big cut. The biggest contribution of course is about LIBSVM's paper, on Google Scholar, this article is actually nearly 20000 citations, really scary. It is not surprising to think carefully, all kinds of research work, as long as the classification, most will use the SVM algorithm or with SVM algorithm comparison, and at this time LIBSVM is often the preferred tool. In fact, not only the academia, LIBSVM in the industrial sector also has a very wide range of applications. This is due to the stability and efficiency of the algorithm implementation, on the other hand because LIBSVM provides a rich interface and flexible mode of use. Some of the most well-known machine learning tools, such as Java-based Weka and Python-based Scikit-learn, provide SVM algorithms that are based on LIBSVM implementations at the bottom.

A few years ago, I attended a machine learning class, and the presenter was Eric Xing, the CMU teacher. In the question session, everyone is discussing some advanced theoretical issues, suddenly a girl raised his hand to ask the presenter if you want to use SVM, what tools are recommended. The problem seemed a bit out of the question at the time, Eric Xing, who replied that their CMU never used a library of others, and that all algorithmic code was implemented by itself. I didn't remember any of the other things in that lecture, and the deepest impression became the question and answer. Now, if you want to learn from the algorithm, the realization of these algorithms for theoretical understanding and the improvement of engineering ability is of great benefit. But for the majority of people, their own implementation of the SVM algorithm in terms of efficiency and scalability, or will be LIBSVM with the implementation of the algorithm there is no small gap, after all, the program is thoroughly tempered in the production. A good tool that everyone recognizes is a good choice if it is simply for use. However, having a good tool does not mean that we can give everything to the tool without knowing the algorithm and implementation. Conversely, if we are well aware of some of the background in algorithms and implementations, we can better use these tools. In addition, LIBSVM and his sister tool Liblinear provide a wealth of optimization and parameter options, and can greatly improve our productivity by choosing the right method. So, here are some important questions about LIBSVM and liblinear that have been encountered in my previous work and hope to help readers with the use of these tools.

The relationship between LIBSVM and Liblinear

This is a problem that many people who have just started using SVM can easily confuse. Simply put, LIBSVM is a complete set of SVM model implementations. Users can use and kernel functions in LIBSVM to train non-linear classifiers, but they can also use more basic linear SVM. Liblinear, a toolkit designed for linear classification scenarios, can support linear logistic regression and other models in addition to linear SVM, but cannot implement nonlinear classifiers by defining kernel functions. Because of the support for the expansion of kernel functions, LIBSVM theoretically has stronger classification ability than liblinear, and can deal with more complicated problems. However, many people use only LIBSVM, and even the simplest linear classifiers are trained and predicted using LIBSVM, which is also undesirable. Because Liblinear design is designed to improve the efficiency of linear classification, its optimization algorithm and LIBSVM in the optimization algorithm has a fundamental difference. Although both LIBSVM and liblinear can achieve similar results in linear classification, liblinear is much more efficient than LIBSVM both in training and in forecasting. In addition, limited by the algorithm, LIBSVM often after the sample volume is relatively slow, if the sample volume up another order of magnitude, then the usual machine has been unable to process. But with Liblinear, there is no need to worry at all, even if the millions level of data, liblinear can easily be done, because Liblinear itself is designed to solve the model training of large-scale samples.

Although I figured out the main difference between the two, I was very puzzled when I first approached these tools, and wondered why two closely related algorithms evolved two separate kits. What makes me more puzzled is that LIBSVM was released as early as 2000, and Liblinear didn't release the first version until 2007. According to common sense, there should be a simple tool, and then gradually improve, but the more powerful LIBSVM was released long before Liblinear. To answer this question, you have to start with machine learning and the history of SVM.

The

Early machine learning classification algorithms can be traced back to the perception (Perceptron). The basic idea of perception is similar to the logistic regression, except that a linear classifier is trained by means of online learning. Many of the data from the 80 's to the early 90 were seen in the UCI data set for machine learning research. It can be seen that many of these problems are very complex, such as image or speech recognition. But on the other hand, constrained by the ability to acquire data and compute storage at the time, these datasets are typically small in size, with only thousands of or even hundreds of samples. For such a relatively complex problem, it can be thought that the characteristics of the direct use of simple linear classifier classification, certainly will not achieve too good results. This time, machine learning field a milestone multi-layer neural Networks (Multilayer neural network) appeared. Multilayer neural networks introduce the hidden layer (hidden layers), and the expression ability of the model is greatly enhanced, and various complex classifiers can be trained. However, neural networks also have a fatal weakness, because of the limitations of the model itself, it is very easy to fit, especially in the case of less training samples. When SVM came into being, it solved the problem perfectly. On the one hand, the objective function of SVM is a convex function, which can guarantee the global optimal solution of the problem, and avoids the disturbance that the neural network optimization frequently falls into local optimal. On the other hand, there is a set of structured risk minimization theory behind SVM, and given training samples and training parameters, it is possible to calculate the error bounds of the model in real data theoretically. In SVM, the tradeoff between the model variance and the training error can be made by adjusting the parameters and selecting the sample size. In addition, SVM can define different kernel functions to construct non-linear classifiers, and can get the classification ability roughly equivalent to the neural network method, so as to adapt to different problems. Therefore, at the end of last century to this is the basis, SVM swept the various classification of the application scenarios, became the most popular machine learning algorithm.

However, SVM also has limitations. Firstly, SVM solution based on kernel function is relatively complex, it needs to store a dense kernel matrix between samples, when the sample size is very large, the storage capacity is quite considerable. So far, there has not been a very effective parallel SVM training method can fundamentally improve the training of SVM model. More than 10 years ago, when the sample size was at most tens of thousands of levels, the problem was not important. But more than 10 years later, with the explosion of the Internet, any model of training samples can be hundreds of billions of dollars, then SVM in the big data training on the shortcomings of the obvious. The effect of SVM is mainly due to the introduction of nonlinear kernel functions. But new problems continue to arise, and these problems involve different areas of knowledge and business scenarios, often relying on only a few common kernel function does not solve the problem. However, SVM relies too much on kernel functions, and there are many limitations of kernel functions, and its flexibility is certainly inferior to that of artificial feature construction methods. On the other hand, with the increasing amount of data, even though these samples are not directly labeled for model training, many machine learning methods can be used to automatically learn features from a large number of samples. For example, in the early years of manifold learning, as well as the text of the theme model, the image of sparse coding and dictionary learning. These nonlinear methods of learning sample characteristics, often is the high-level semantic expression of the sample, with sufficient data, only need to use a relatively simple linear classifier, you can achieve better results. At this point the main contradiction becomes the classifier must have the ability to handle a large enough sample, and in the method, can be simply the current method. At this time, Liblinear was born.

In the era of the birth of LIBSVM, the nonlinear model of SVM is one of the main advantages of SVM, and the sample size is not a bottleneck at that time. Therefore, the whole framework of LIBSVM is trained for training kernel SVM model. But if you just need to train a linear SVM model, then the algorithm can be much simpler and more efficient. Therefore, the new training algorithm is adopted to support the training of linear SVM and logistic regression in the case of maintaining the liblinear of the basic interface and calling method. LIBSVM and liblinear author Linzhiren teacher in many later speeches, are vigorously promoting liblinear, and give a lot of practical examples to prove that the artificial structure features + linear model can achieve or even more than the performance of kernel SVM, It also greatly reduces the time and resources spent on training.

In fact, in recent years, there have been new changes in the situation. The way to construct features + linear classifiers has been a bottleneck in many problems. At the same time, on the one hand, the amount of data available is much larger, on the other hand, computer computing power has increased rapidly. At this time by the SVM has been hard pressure on the ground of the neural network and renewed vitality. Compared with SVM, the advantage of neural network model is that various kinds of flexible classifiers can be designed by controlling the number of layers of the model and the type of each layer function. At the same time, neural network optimization algorithm is more suitable for parallelization than kernel SVM. The main problems that influenced the development of neural networks were the limitation of computational resources and the overfitting caused by the small sample size. But now these two restrictions are almost nonexistent. GPU-based parallel computing technology is now more mature and can support high-speed parallel computing. The reason for the overfitting is that the training sample is so much less than the true sample number that it is not able to reflect the actual data situation. But if we take all the samples we have as a training sample, the opportunity is already a real sample set, so the fact that the overfitting is not there. Although neural networks are theoretically flawed, these flaws are no longer a problem by the increase in computational power and data. For the reasons above, the hot spots of machine learning have been back to neural networks for years.

In 1995, the inventor of SVM Vapnik and he played two bets at the Bell Lab's eldest brother Larry Jackel, and the witness was the Yann LeCun, who was also recently famous for his deep learning at Bell. The specific contents of the game are as follows: bet-by-2000 whether they win or lose or not, we can feel the change of this subject quickly. When the neural network is large, it is hard to think of a svm that kills a single nerve network algorithm. In the SVM eminence, most people will not believe that the neural network can finally wait until the day of the attack.

Look carefully at each time the innovation of the algorithm, in fact, the real impetus is the specific problem needs and the technical conditions at that time. Therefore, it doesn't make much sense to separate from the specific application scenarios to compare the algorithms. For the user, the algorithm that best fits the problem scenario is the best algorithm. specifically to LIBSVM and liblinear, I try to summarize the following principles:

For any scenario that determines the use of a linear classifier, be sure to use liblinear instead of LIBSVM
If the sample size is large, such as reaching the scale of 100,000 or more, then LIBSVM is difficult to deal with. If the effect of the linear classifier is not good, it can only adopt the method of artificial structure feature +liblinear, or adopt other classifiers, such as neural network, random forest, etc.
For high-dimensional sparse data, a typical vector space representation, such as text, generally uses a linear classifier.
For sample size and dimension are not too big problem, and do not have excessive demand for prediction efficiency, can use LIBSVM to try kernel SVM classifier, in many cases with kernel SVM than directly using libear SVM or can achieve higher accuracy.


Model and optimization

LIBSVM and Liblinear provide a variety of different models for the user to choose from, and the different models have their own applicable scenarios. The various models provided by LIBSVM and Liblinear are described below respectively.

Libsvm

The following is an introduction to the LIBSVM Help content, and gives the 5 models supported by LIBSVM. The models 0 and 1 correspond to the SVM classification model, 2 corresponds to the One-class classifier, that is, only one label needs to be labeled, and the Model 3 and 4 correspond to the SVM regression model.

1-s svm_type:set type of SVM (default 0) 2 0--c-svc (Multi-Class classification) 3 1--Nu-svc (Multi-Class classific ation) 4 2--One-class SVM 5 3--EPSILON-SVR (regression) 6 4--NU-SVR (regression)

First, look at the most basic C-SVC model. SVM can be written as the following optimization objective function (the derivation algorithm is not described in detail here):

Argminw,b,ξsubjectto 12wtw+c∑i=1lξiyi (WT. ( XI)? b) ≥1?ξi,ξi≤0,i=1,..., l

When the model uses linear kernel, that is? (x) =x, the above problem a standard two-time convex optimization problem, can be more convenient to each variable derivative. There are many fast optimization methods for solving such problems, and these methods are used in Liblinear. But if the introduction of kernel SVM, the situation is very different. Because many times we can neither get the concrete form of the kernel function nor get the new expression of the characteristic in the nuclear space. At this time, the solution thought used in linear SVM is completely out of work. In order to solve this problem, we must adopt standard SVM to solve the problem, first turn the original problem into duality, and get the following objective function (the specific process can refer to any information about SVM):

Argminαsubjectto f (α) =12αtqα?etα0≤αi≤c,i=1,..., l,ytα=0

Through the duality change, the above objective function becomes a two-th type about the variable α. Obviously, the most important thing in the objective function above is the matrix Q, which is the kernel matrix that trains the sample to meet the qi.j=? (xi) T? (XJ). On the one hand, we can guarantee that Q is a positive definite matrix according to the definition of kernel function. In other words, the above objective function is also a convex function, and the solution which can be guaranteed by optimizing convergence is the global optimal solution, which is one of the important advantages of SVM. But the problem also comes with the use of commonly used kernel functions, as long as any given two vectors, always be able to calculate a distance of not 0. This also means that the matrix Q will be a very dense matrix, if the training sample is enough, then the storage and computation of matrix Q will become a big problem, which is the biggest challenge in SVM optimization algorithm.

Because the matrix Q is too large, it is more difficult to optimize the whole α at once. So the usual method is to first remove the Q-eight pieces, select a part of the Q each time, and then update the value of this section Q-related α. One of the most famous algorithms is the SMO algorithm proposed by John C. Platt, and the LIBSVM optimization process is based on the SMO algorithm. Each iteration of the SMO algorithm selects the smallest optimization unit, which is the fixed other α, which selects only two alpha values for optimization. The reason is not to choose one, because there are ytα=0 constraints, at least select two α coordinates to be possible to update. The main purpose of this article is to introduce LIBSVM, so we will not discuss the details of SMO in detail. As for the implementation of the specific algorithm in LIBSVM, in the LIBSVM's official paper is described in detail, here to summarize some of the key issues:
? Working Set, which is the selection of the alpha part that needs to be optimized
? settings for Iteration stop conditions
Alpha Update algorithm, which is the solution to every step problem
? Shrinking, that is, remove some of the already satisfied α, speed up the convergence rate
? Cache, the matrix needs to be cached when the Q matrix is too large.

Each of the above questions is not easy to handle. As a user, there may be no need to be familiar with all the details. I think the two problems that need to be recognized most are: 1 The objective function of SVM seems to be a standard optimization problem, but the actual solution is much more complicated. In order to improve the speed of the solution, it is necessary to optimize the algorithm and to do engineering improvement. If you just simply follow the textbook method, or even directly call some of the optimized toolkit to implement the SVM algorithm, at most even a demo. It's not easy to be able to really write an efficient and stable SVM tool that can handle large-scale data. So it is much simpler to use LIBSVM than to implement the algorithm. 2) The reason why SVM is optimized is because the problem itself is difficult to compute and store. So even with so much optimization, the efficiency of the algorithm is still low. So we need to pay attention to the details of various procedures, improve the efficiency of the operation. In addition, when the sample size is too large, sometimes in order to make full use of the data, also have to reluctantly, abandon the use of kernel.

In addition to the standard C-SVM,LIBSVM also provides support for some other SVM methods. The Ν-SVM and C-SVM algorithm and the application scenario are basically the same, the only difference is that the original parameter C becomes the parameter ν. The parameter C of the C-SVM is adjusted in [0,+∞], and the corresponding parameter in Ν-SVM ν adjustment range becomes (0,1]. This setting makes the Ν-SVM more explanatory and sometimes provides some convenience in parameter settings. But Ν-SVM and C-SVM do not exist in the nature of the difference, through the adjustment of parameters, the two can achieve the exact same effect. So in the use of LIBSVM processing classification problem is, choose any of the above methods are OK, just need to follow their own habits.

One-class SVM is also a classification method supported by LIBSVM. As the name implies, when you use one class, you only need to provide a class of samples, and the algorithm learns a small spherical surface to wrap all the training samples. One-class SVM looks tempting because we often encounter a class of samples that require learning the classifier. But in fact, on the one hand many times we get a positive sample in the sampling process there is a large deviation, resulting in the learning of the one class classifier does not necessarily take into account all the positive samples of the situation, on the other hand, most of the problems still exist many methods of constructing artificial negative samples. According to my experience, the common SVM effect is usually better than One-class SVM, and the use of One-class SVM in real scenes is not much. Therefore, it is necessary to study the problem in more depth before using this method.

Finally, LIBSVM supports the SVM-based regression model, the SVR. Similar to the classification model, SVR is also divided into C-svr and ν-svr. The target function of SVR is slightly different from SVM's classification model. Because the regression problem predicts that the deviation from the target value can be much smaller, the SVR uses two slack variable to characterize the error boundary of the prediction. Although there is such a difference, but the basic idea of the two and optimization algorithm and still basically consistent.

In the implementation of LIBSVM, the above five models, namely C-svm,ν-svm,one-class SVM,C-SVR,Ν-SVR, can eventually be transformed into a more general optimization framework, and then solved with the same strategy, which is the main function of LIBSVM implementation. In actual use, the most commonly used method is C-SVM, which is the most traditional SVM classification model.

Liblinear

Liblinear was developed after LIBSVM popular for many years, the problem to be solved is also more simple than LIBSVM, its advantage mainly lies in efficiency and scalablility. This advantage exists because linear SVM is much simpler to solve than kernel SVM.

From the above duality problem, the solution of SVM is ytα=0 to a large extent, so it is necessary to select a set of α to optimize each time. If we traced the source of this constraint term, we can find that this item is obtained by making the model's constant term B derivative 0. In a linear model, we can simply trick, x=[x,1] and w=[w,b], so that the constant term in the model does not exist. Of course, such trick can only be applied in linear models. Without the above constraint, the objective function of optimization becomes:

Argminαsubjecttof (α) =12αtqα?etα0≤αi≤c,i=1,..., l

At this time, you can select only one like at a time to optimize, each round to traverse all the dimensions of α, multi-iteration, until the final convergence. Such an optimization algorithm is called coordinate descent (coordinate descent method). By using the particularity of linear function, the vector representation of W can be calculated directly according to α, so the efficiency of the algorithm is greatly improved. The specific optimization algorithm can be referenced in document A Dual coordinate descent Method for large-scale Linear SVM.

To change the angle of the problem, the objective function of linear SVM can be written in the following form:

ARGMINW12WTW+C∑I=1L (Max (0,1?YIWTXI))

To further abstract the problem, a class of classification problems can be written in the following form:

argminwω (w) +c∑i=1l? (YI,WTXI)

The error function is used to measure the loss of the predicted value and the target value. In the case of linear SVM above, there are

? (Yi,wtxi) =max (0,1?YIWTXI)

This is called Hinge Loss.

As in the logistic regression, loss function is defined as

? (Yi,wtxi) =log (1+e?yiwtixi)

Omega is commonly referred to as regularization (Regularizer), the most commonly used is the previous one? 2-norm, writing WTW, can also write ∥w∥22, that is, the sum of squares of all elements in the vector W. In addition to the 2-norm, 1-norm often uses regularizer and brings some special effects (discussed later). A large number of supervised learning models can be written in the form of loss function + Regularizer, while parameter C controls the proportion of the two in the final loss function. The selection of different loss function and Regularizer and the balance between them is one of the most important topics in machine learning.

For the above problem, there are many mature algorithms can be used to solve the model, such as the steepest gradient method, Newton method, for large sample size, you can also use the method of random gradient training. In general, because the second derivative is considered, the optimization efficiency of Newton's method is higher than that of the first-order derivative. But because Newton's method itself has many limitations in computational quantity and convergence, it is seldom used directly, but is improved on the basis of Newton's method. Among the algorithms commonly used are BFGS and L-BFGS. Specifically to the Liblinear software package, the author uses the Trust Region Newton (TRON) method to improve the traditional Newton method, which proves to be more efficient than L-BFGS training.

The training of L-2 SVM and logistical regression model based on Tron method is implemented in Liblinear. The L2-loss SVM is a variant of the standard SVM, and the loss function becomes:

? (yi,wtxi) = (max (0,1?YIWTXI)) 2

From the actual effect, the L2-loss SVM is not much different from the standard L1-loss SVM. But in the calculation, the derivation form of the former is more simple, which facilitates the calculation and optimization of the gradient. Liblinear does not implement the standard L1-loss SVM implementation of the Trust region Newton method, on the one hand, because the direct hinge loss derivation needs to be discussed more complicated, on the other hand L2-loss SVM can be directly replaced L1-loss Svm. However, in some other packages, such as Svmlin, the original problem of L1-loss SVM is solved, but the optimization algorithm used is L-bgfs instead of Tron.

Summarize

This paper introduces the optimization algorithms of LIBSVM and Liblinear, and briefly summarizes the application scenarios of different algorithms:
? All linear problems are using liblinear instead of LIBSVM.
? LIBSVM in the different algorithms, such as C-SVM and NU-SVM in the model and the solution is not the essential difference, just do a parameter transformation, so choose their own habits of good.
? The optimization algorithm of Liblinear is divided into two main categories, namely, solving the original problem (primal problem) and duality problem (dual problem). The Tron optimization algorithm is used to solve the original problem, and the dual problem uses the coordinate descent optimization algorithm. Overall, the two algorithms are highly efficient, but they are also better at each other. For a small sample size, but particularly high-dimensional scenes, such as text classification, more suitable for dual problem solving, because the sample size, the calculation of the kernel matrix is not big, the subsequent optimization is more convenient. If the original problem is solved, the high-dimensional characteristic matrix should be calculated frequently in the process of derivation, and if the features are sparse, then many meaningless computations will be made and the efficiency of optimization can be affected. Conversely, when the number of samples is very large, and the feature dimension is not high, if solve duality problem, because the kernel matrix is too big, the solution is not convenient. Instead, it is easier to solve the original problem.

Multi-Classification problem

LIBSVM and Liblinear support multi-classification (Multi-Class classification) issues. The so-called multi-Classification problem, that is, each sample category tag can be more than 2, but the result of the final prediction can only be a category. For example, the classic handwritten numeral recognition problem, the input is an image, the last output is 0-9 of these 10 numbers of one.

LIBSVM and Liblinear but the way of implementation is completely different. LIBSVM takes a strategy of one vs single, which is to train a classifier between all categories 22. In this way, if there is a K class, it is theoretically necessary to train K (k?1)/2 classifiers. In fact, LIBSVM in this step is also a certain optimization, using the existing classification of the relationship, reduce the number of classifiers. However, LIBSVM has to train the classifier multiple times on multiple classification issues. However, given the LIBSVM optimization method mentioned earlier, the complexity of training increases linearly with the increase in the number of samples. Through the 1VS1 strategy, we can ensure that the sample size of each sub-classification problem is not too much, in fact, it is convenient for the entire model training.

Liblinear, however, takes another training strategy, one vs all. Each class corresponds to a classifier, and the sub-sample is all of the other classes of samples. This is more efficient than one vs one because liblinear can and need to handle a much larger scale of training than LIBSVM. In addition, Liblinear implements the SVM multi-classification algorithm based on the Crammer and singer method, and learns each class corresponding classifier in a unified objective function.

Output file

In general, we use Liblinear or LIBSVM, we can directly call the system training and prediction function, do not need to directly contact the training obtained model files. But sometimes we may need to implement the prediction algorithm on our own platform, it is inevitable to parse the model file.

Because Liblinear and LIBSVM have different training models, their corresponding model file formats are different. The format of Liblinear training results is relatively simple, for example:

1 Solver_type l2r_l2loss_svc_dual 2 nr_class 3 3 Label 0 1 2 4 nr_feature 5 5 bias-1 6 W 7-0.4021097293855418 0.10 02472498884907-0.1619908595357437 8 0.008699468444669581 0.2310109611908343-0.2295723940247394 9- 0.6814324057724231 0.4263611607497726-0.4190714505083906 10-0.1505088594898125 0.2709227166451816- 0.1929294695905781 11 2.14656708009991-0.007495770268046003-0.1880325536062815

The above Solver_type represents the solution algorithm, and W below represents the model weights that are solved. Each of these columns corresponds to a class classifier, and each row corresponds to a dimension of the feature. Where Nr_class represents the number of solutions, Nr_feature represents the dimension of the feature, bias represents the bias of the model, and can be manually set weights. What's easy to misunderstand here is the Label field, which represents the number of columns in each user's training file that have a label corresponding to W. For example, in the model above, the user specifies that the classifier with number 0 corresponds to the first column of W. But the above correspondence does not exist, for example, in the two classification scene, with the whole sample labeled 1, Negative sample labeled 0, but in the model training, Liblinear will be in accordance with their own numbering system training, there may be negative samples in front, positive samples in the latter case. At this point, it is necessary to match the Liblienar internal numbering system with the real user tag according to label 1 0. Of course, later Liblinear and LIBSVM did some optimizations, when the two classification, if the positive and negative sample labels are 1 and +1, then you can always ensure that the positive sample appears in the first column of W. But this mechanism is not completely reliable, for example, in the Liblinear Spark implementation code, there is no implementation of this feature, once I was the whole miserable. So there is a need to be very careful here.

The LIBSVM training result format is more complex, for example:

1 Kernel_type RBF 2 gamma 0.0769231 3 Nr_class 3 4 total_sv 5 rho-1.04496 0.315784 1.03037 6 label 1 0-1 7 Nr_ SV 2 2 1 8 SV 9 0 1 1:0.583333 2:-1 3:0.333333 4:-0.603774 5:1 6:-1 7:1 8:0.358779 9:-1 10:-0.483871 12:-1 13:1 10 0 0.6 416468628860974 1:0.125 2:1 3:0.333333 4:-0.320755 5:-0.406393 6:1 7:1 8:0.0839695 9:1 10:-0.806452 12:-0.333333 13:0.5 11 0 1 1:0.333333 2:1 3:-1 4:-0.245283 5:-0.506849 6:-1 7:-1 8:0.129771 9:-1 10:-0.16129 12:0.333333 13:-1 12 0.268546689584 2373 0 1:0.583333 2:1 3:1 4:-0.509434 5:-0.52968 6:-1 7:1 8:-0.114504 9:1 10:-0.16129 12:0.333333 13:1 13 0 1 1:0.208333 2 : 1 3:0.333333 4:-0.660377 5:-0.525114 6:-1 7:1 8:0.435115 9:-1 10:-0.193548 12:-0.333333 13:1

The meaning of the above parameters is more direct, it should be noted that the SV is followed by the training of the model parameters, to support the way the vector storage. NR_SV gives the corresponding model for each support vector, such as "2 2 1", which means that the first two lines are the support vectors labeled 1, the next two are the support vectors labeled 0, and the last line is the support vector for the class labeled-1. For each line of support vector, in the above model, because there are three classes, so each support vector may be present in two classifiers, so the number of the first two columns corresponding to the remaining two classification as a support vector when the alpha value, followed by the true support vector.

Adjustment parameters

The LIBSVM and Liblinear kits contain a number of parameters that need to be adjusted, and the adjustment of parameters requires both patience and a lot of skill. Of course, there is also a need to know the meaning of the parameters and the impact on the model. The following is a discussion of some parameters that have a greater impact on the model

Parameter C

Parameter c is a parameter to be used in the solution of Liblinear and LIBSVM. The various models mentioned earlier can be written in a uniform form:



Argminwω (? ( W)) +c∑i=1l? (Yi,?) (W) T? (xi)) (1)

One of the items on the right is the loss of the model, and its size indicates how much the classifier fits the sample. On the left, it is a human-added loss, unrelated to the training sample, called a regularization term (regularizer), which reflects some additional constraints on the training model. The parameter, C, is responsible for adjusting the weights between the two. The larger the C, the more the model is required to fit the training sample data, the more the model satisfies the constraint of the regularization term. Take Liblinear as an example, the following first discusses the situation of Liblinear 2norm:



Argminw∥w∥22+c∑i=1l? (YI,WTXI) (2)

The reason is to increase the regularization, because in the design of the model, we do not have sufficient confidence in the quality of the sample and the generalization ability of the model, that in the absence of other constraints, the trained model will be too accommodating to the existing sample data to achieve the same effect on the new data. At this time, it is necessary to add some knowledge of human experience to the model. Like on the face? (w) Increase? 2norm is the constraint. If the loss function in the above formula corresponds to a regression problem, then this problem is called Ridge Regression, and Chinese is called Ridge regression or ridge regression.

Can we stand in different ways to understand the meaning of the 2norm regularization term. If W is regarded as a parameter estimation problem in the Learning classification function, then the objective function without regularization is the problem of maximum likelihood estimation for W. To make the estimation of W closer to the real situation, we can develop a priori distribution of w based on experience. When we assume that the W priori distribution is a multivariate Gaussian distribution, and that there is no correlation between the different dimensions (i.e., the non-diagonal element of the covariance matrix is 0), and that the variance of each dimension feature is a fixed one, then the maximum posteriori probability deduced is the objective function of the above with the regularization term. and c is related to the variance of the W prior distribution. The larger c means that the regularization effect is weak, the range of W can be larger, the variance of the prior is greater, and the smaller C means that the regularization effect is stronger, the range of W is smaller, and the prior variance becomes smaller. By reducing the value of C, you can control the fluctuation of W not too large, so as not to be affected by some data, resulting in model overfitting (overfitting).
There is also a more intuitive explanation, the above regularized form of the objective function can also be converted to constraint form according to the KKT condition, namely:



Argminws.t.∑i=1l? (Yi,wtxi) ∥W∥22<S2 (3)

The size of W is limited by the parameter s, while S and C also have a certain positive correlation relationship. Therefore, when C is small, the value of W is also limited to a very small range. The following figure gives a very intuitive explanation:

L2norm

Because of the limitations on the W value, there are two cases. The first is when the S is not large enough, at this point, if the gradient down the direction of the search, to find the global optimal solution, it has been found outside the loop, do not meet the following constraints. At this time, only the best possible solution can be found under the condition of satisfying the constraint. According to the KKT condition, the optimal solution at this time must be the position of the gradient line tangent to the circle with the target function, as shown on the left side of the image above. If the gradient map is regarded as the contour of a mountain, the position of the optimal solution must be the convex part of the contour, similar to a ridge or mountain on a hill, which is also the origin of the Ridge (ridge) return. Another case is when s is large enough to be able to reach the global optimal solution within the range delineated by S, and at this point the following constraint does not actually work, as shown on the right side of the image above.

Therefore, in the process of parameter adjustment, if the amount of data is less, or we have insufficient confidence in the quality of the data, we should reduce the size of C, increase the importance of a priori, conversely, you can increase the size of C, so that the data itself a larger role. In the process of optimization, the larger the C, the greater the range of W to search, and the higher the cost of the calculation. As can be seen from the previous analysis, when C has been increased to a certain extent, the model has been able to ensure that the global optimal, at this time continue to increase the C to improve the performance of the algorithm has not helped. Therefore, in the liblinear, the actual recommended practice is to adjust the value of C from small to large, when C has been unable to change the effect of the algorithm, the C to the model has no effect, there is no need to continue to tune down. Choosing a smaller C can improve the speed of the model convergence.

? Use of 1norm

In Liblinear, in addition to providing the above mentioned 2norm regularization, the option to 1norm is also provided. 1norm is generally written as ∥w∥1, which is the sum of the absolute values of all the elements in the vector W. Compared with the 2norm, 1norm also has a constraint on the size of W itself, so that some dimension values of W are not too large to lead to overfitting. This feature is also statistically referred to as contraction "shrinkage". In addition, the 1norm has another very useful feature, which is the ability to make the learned W sparse (sparse), which means there are a lot of 0 items, and the number of 0 items can be controlled by the factor C. When C is reduced, the non-0 of W increases, and when C is infinite, W can become all 0 because there is no pressure to fit the loss at all.

Why? 1norm also has such a function. A long story, the next time you can write a blog discussion. In short, it is determined by the special nature of the 1norm. The absolute value function corresponding to 1norm is continuous, but not everywhere, because there are special cases at 0 points, that is, the left and right derivative are not equal. Therefore, it can be considered that the definition of sub_gradient to describe the derivative of this particular case, the result is that at this particular point, the derivative is not a value, but a range between the left and right of the reciprocal. Since the 0-point derivative value is very flexible, so that in the process of model solving, it is easy to reach the extremum at such points, it will make the learning W as sparse as possible.

The linear regression model based on the 1 is also called Lasso. The basic motive for adopting this penalty is to think that only a few traits are really relevant to the results of the classification, and most of the characteristics are irrelevant. There is a certain truth in this hypothesis, but in the actual classification accuracy, the 1norm regularization item does not have an absolute advantage over the 2norm item, because 2norm does not directly remove those features unrelated to the classification results, but the results of the model learning also make these features very low weight, So they're not going to make much of a difference. But one of the benefits of 1norm is that it can be used as a tool for feature selection, and only a few of the features that are most relevant to the classification results are selected from the high-dimensional feature space. This is very helpful in dealing with a large number of high-dimensional data or real-time computing problems, which can greatly reduce storage and improve the efficiency of computing.

Kernel related parameters

If you use LIBSVM, the parameter adjustment will be more complicated. The first is the use of kernel. In general, RBF kernel are encouraged to be used preferentially. If the use of RBF kernel effect can not be adjusted to the satisfaction, then the use of poly or linear is useless. In some special scenarios, you can consider customizing the kernel, which is also supported in LIBSVM.

The full name of the RBF is radial Basis function, the Chinese is called the radial basis functions. In general, the RBF used in SVM is Gaussian RBF, which is what we call the Gaussian kernel function generally. Given two points X1 and X2,gaussian RBF are defined as:



K (X1,X2) =exp (? γ∥x1?x2∥2) (4)

It can be seen that the Gaussian kernel function is a certain transformation of the Euclidean distance between two points, and the transformation is controlled by the parameter gamma. How to understand the significance of the Gaussian kernel function and the role of Gamma.



K (X1,X2) =exp (? γ∥x1?x2∥2) =exp (? Gamma (? ∥x1∥2+2xt1x2?∥x2∥2)) =exp (? γ∥x1∥2) exp (? γ∥x2∥2) exp (2γxt1x2) =exp (? γ∥x1∥2) exp (? γ∥x2∥2) ∑n=0∞ (2γxt1x2) nn! (5)

When Γ is smaller, or when extreme situations tend to be 0, there can be ∑∞n=0 (2γxt1x2) nn!≈2γxt1x2, which is much smaller than n=1 items after n>1. At this time, the effect of RBF kernel in fact and linear kernel similar.

Conversely, when γ increases, the n>1 after the term also has the function, its basic idea and the poly kernel is similar, but the RBF directly the dimension from the finite dimension the polynomial rise to the infinite dimensional polynomial just. When Γ infinitely increases, it can be seen that unless ∥x1?x2∥2 is 0, K (x_1, x_2) will be infinitely nearer to 0. In other words, when γ approaches infinity, each data point has a distance of 0 from each other except for itself. In this case, the result of the model training is only to use all the points as support vectors, so the accuracy of the training data can reach 100%. Such a model can also be seen as a special case of KNN, at which time K equals 1.

The two extreme cases of the problem were discussed above. When Γ is infinitely small, the RBF kernel SVM and the linear SVM effect are similar, so the complexity of the model, or the VC dimension is lower, it is not easy to fit. When the gamma value increases infinitely, all the points become support vectors, the model is more complex or the VC dimension is the highest, and the most easily overfitting. In general, Gamma takes an intermediate value, also the meaning of the two, compared to the linear model, you can choose more support vectors, increase the complexity of the model and the ability to fit. Compared with the 1-NN model, the complexity of the model is reduced and the risk of overfitting is avoided. In addition, it can be seen from the above discussion that the RBF nucleus can achieve almost the same effect as the linear nucleus and the poly nucleus by adjusting the parameters. Therefore, in order to reach the optimal model without considering the computational efficiency, only the parameters of the RBF model should be adjusted.

In the discussion of parameter adjustment above, it is assumed that the parameter C is fixed. However, in the process of the actual SVM, the C and γ are changed simultaneously, which further increases the complexity of the parameter tuning. The relationship between the two variables, there are many theoretical analysis and discussion, here is not too much to discuss, you can refer to the file: asymptotic behaviors of support Vector machines with Gaussian Kernel.

In the actual application scenario, we can use the Cross-validate method to traverse the parameters and choose the best combination of parameters. The LIBSVM also provides a grid.py tool for the parameters of the SVM. In general applications, we only need to set the variable range of parameters, and then in the training data of the parameter combination to traverse, select the optimal combination of parameters can be.

Summarize

Finally, the summary of the parameters and the choice of kernel: * Linear model selection, selection of liblinear, the main adjustment parameters C, from small to large adjustment, if the increase in C on the results of small changes, you can not test it down, choose a smaller C on it. * Kernel Select, if need to use kernel, for general problems, first use RBF kernel. LIBSVM provides a polynomial and tanh kernel functions, there are some limitations, in general, RBF is the use of convenience and model effects are more stable kernel function. * LIBSVM parameter adjustment. If you use the RBF kernel, you need to adjust the parameters C and the parameter γ, the problem is more complicated, the best way is to automatically traverse the parameters for parameter selection, such as the use of grid.py.




??

??

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.