Comparison of SVM and LR

Source: Internet
Author: User

Both methods are common classification algorithms, from the point of view of the objective function, the difference is that the logistic regression uses the logistical LOSS,SVM is hinge loss. The purpose of these two loss functions is to increase the weight of the data points that have a greater impact on the classification, and to reduce the weight of the data points with smaller classification relationships. The SVM approach is to consider only the support vectors, which is the most relevant to the classification of the few points, to learn the classifier. The logistic regression, through nonlinear mapping, greatly reduces the weight of the points farther from the classification plane, and increases the weights of the data points most relevant to the classification. The fundamental purpose of both is the same. In addition, as needed, two methods can add different regularization items, such as l1,l2 and so on. So in many experiments, the results of the two algorithms are very close to each other.

However, the logistic regression is relatively simple, good understanding, implementation, especially large-scale linear classification is more convenient. The understanding and optimization of SVM is relatively complex. However, the theoretical basis of SVM is more solid, a set of structural risk minimization of the theoretical basis, although the general use of people will not pay attention. It is also important that, after the conversion of SVM to dual problems, the classification only needs to calculate the distance from a few support vectors, which has obvious advantages in the computation of complex kernel functions, and can greatly simplify the model and computational amount.

In Andrew Ng's class:
1. If the number of feature is very large, similar to the number of samples, this time using LR or linear kernel SVM
2. If the number of feature is smaller, the sample quantity is general, not big is not small, choose Svm+gaussian Kernel

3. If the number of feature is small and the number of samples is large, it is necessary to manually add some feature into the first case

What are the similarities and differences between Linear SVM and LR?

They are all linear classifiers, the model solves a super-plane (assuming that the problem is 2 categories); The following is a discussion of the differences.

Linear SVM Intuitively is a trade-off two volume
1) A large margin, that is, between the two categories can be drawn between the gap; it may be said that the positive sample should be in the demarcation plane to the left GAP/2 (called positive demarcation), negative samples should be on the decomposition plane to the right gap/2 (called negative demarcation) (see)
2) L1 error penalty, do L1 for all points that do not meet the above conditions penalty

The

can be seen, given a data set, that once the solution of the linear SVM is completed, all data points can be grouped into two categories
1) The one that falls outside the corresponding demarcation plane and is correctly classified, such as a positive sample falling on the left side of the positive boundary or a negative sample that falls to the right of the negative boundary
2) The second category is the point that falls in gap or is wrongly classified.
Assuming that a dataset has been solved by linear SVM, adding or removing more than one type of point into the dataset does not alter the linear SVM plane . That's what it differentiates with LR features, below we look at LR.


It is worth mentioning that in the process of solving the LR model, each data point has an effect on the classification plane, and its influence away from it to the classification plane of the distance exponent decreases. In other words, the solution of LR is affected by the distribution of the data itself. In practical applications, if the data dimension is high, the LR model will match the L1 regularization of the parameter.

To say what the essential difference is that two models of data and parameters of different sensitivity, Linear SVM is more dependent on the coefficients of penalty and data expression space measurement, and (with regular term) LR is more dependent on the parameters to do L1 regularization coefficients. But because they are more or less a linear classifier, in fact, the ability of low-dimensional data overfitting is relatively limited, compared to the high-dimensional data, LR performance will be more stable, why?

Because the linear SVM is dependent on the distance measure of the data expression when calculating the "width" of the margin, in other words, if this measure is not good (badly scaled, this situation is particularly significant in high-dimensional data), the so-called large Margin is meaningless, and this problem can not be completely avoided even with kernel trick (for example, with Gaussian kernel). So before using linear SVM, it is generally necessary to normalization the data first, while solving LR (without regularization) is not necessary or the result is not sensitive.


LR Another name in the NLP world is the maximum entropy model, and of course I'm not going to take the time to explain this, and it's interesting to see, for example,
http://www.win-vector.com/dfiles/ Logisticregressionmaxent.pdf
If you understand the intrinsic of the maximum entropy model, it should be easy to see that LR is a distance measure that does not depend on data.

Summarize the

    • Linear SVM and LR are linear classifiers

    • Linear SVM is not directly dependent on data distribution, the classification plane is not affected by a class of points, and LR is affected by all data points, if the data are different categories strongly unbalance generally need to balancing the data first.

    • Linear SVM relies on the distance measure of the data expression, so it is necessary to do the data first NORMALIZATION;LR unaffected by it.

    • Linear SVM relies on penalty coefficients, which need to be validation in experiments.

    • Linear SVM and LR performance will be affected by the outlier, and its sensitivity to the extent to which it is better to make clear conclusions.


Note: Without regularization of LR, the purpose of the normalization is to facilitate the selection of the initial value of the optimization process, does not mean that the final solution of the performance will be related to normalization, if the maximum entropy model is explained, in fact, the optimization goal is independent of the distance measure, And its linear constraints can be shrunk (both sides of the equation can be multiplied by a factor), so doing normalization is only easier to select initial values for solving the optimization model. It is easy for beginners to confuse model creation with model solving.
Note 2: Looking at the performance of Linear SVM and LR on the UCI data set, the Linear SVM is slightly better than LR on the small scale data set, but the difference is not particularly large, and the computational complexity of Linear SVM is limited by the data volume, which is more widely used in the massive data LR. Do we need hundreds of classifiers to Solve Real World classification problems?

In fact, these two classifiers are very similar, are to maximize the distance between the two types of points, but LR all points are included in the model to consider the scope, and SVM only look at the support vectors, which is the closest point to the classification plane. So the advantage of SVM is that by ignoring the points that have been classified correctly, the model that is trained finally is more robust and insensitive to outlier.

Specific to the loss function, LR with the Log-loss, SVM is Hinge-loss, the similarities between the loss in the wrong classification is very large, but for the correct classification of the point, Hinge-loss no matter, and Log-lo SS also have to take into account. In addition, because Log-loss is exponentially increased at mis-classified points, and Hinge-loss is linearly growing, LR behaves poorly in the occasional mis-label scenario.

There is also a point is that using SVM to predict the probability of a small meaning, the family model itself is not based on probability. LR is based on Log-likelihood ratio, which makes it easy to give probabilities and extends more directly to the multi-class. (SVM does multi-class is not not, but objective function is very chaotic, in practice, generally direct use of one-vs-all)

In addition regularization here no difference, l1/l2 Two can use, the effect is similar. Class imbalance words SVM is generally solved with weight, LR because the probability can be predicted, so you can directly adjust the final results, take different thresholds to achieve the desired effect.

In practice, the speed of LR is significantly faster, the dimension is small when the bias small is not easy to overfit. Conversely Kernel SVM is largely impractical in the case of large data sets, but generally SVM behaves better if the dataset itself is small and the dimensions are high.

Comparison of SVM and LR

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.