The way to solve overfitting (i): regularization

Source: Internet
Author: User
Tags square root
first, over-fitting

The supervised machine learning problem is nothing more than "Minimizeyour error while regularizing your parameters", which is to minimize errors while the parameters are being parameterized. The minimization error is to let our model fit our training data, and the rule parameter is to prevent our model from overfitting our training data. Because too many parameters will cause our model complexity to rise, easy to fit, that is, our training error will be very small.
Overfitting is the problem of fitting the training set data too perfectly, and losing the generality of the new sample and not predicting the new sample effectively, which is also called Gaofangcha (High variances). The cause of overfitting may be that there are too many features or that the model functions are too complex. But the small training error is not our ultimate goal, our goal is to hope that the model test error is small, that is, to accurately predict new samples.
Therefore, we need to ensure that the model is "simple" based on minimizing the training error, so that the resulting parameters have good generalization performance (that is, the test error is small).

The usual methods for solving overfitting problems are as follows:
1. Reduce the number of features
2. Manually Filter Features
3. Using feature screening algorithms
4. Regularization: Retain all characteristics, but as far as possible to make the parameter θjθj as small as possible.
Regularization is useful in situations where many feature variables have only a small effect on the target value. Second, regularization

Regularization refers to a method of solving overfitting problems in machine learning by introducing additional new information. The usual form of this extra information is the penalty imposed by the complexity of the model.
Regularization can keep the model simple, and the use of rule items can constrain the characteristics of our model. One of the theoretical explanations for regularization is that it attempts to introduce the principle of the Ames Razor, which is thought to be: in all possible models, we should choose a model that is well-known and very simple to interpret. From the Bayesian point of view, regularization is the introduction of a priori distribution on the model parameters, which can be expressed as the negative logarithm of the 0 mean Gaussian prior distribution on the weight w (see PRML). In this way, people's prior knowledge of the model can be incorporated into the learning of the model, forcing the learning model to have the characteristics that people want, such as sparse, low rank, smoothing and so on. From the point of view of structural risk and experience risk, Hangyuan Li's "Statistical learning method" is that regularization is the realization of the strategy of minimizing the structure risk, which is to add a regularization item (Regularizer) or penalty (penalty term) to the empirical risk.
For supervised learning problems, if only a few samples lead to a number of features larger than the number of samples, then the matrix XTX will be either an irreversible matrix or a singular (singluar) matrix, or in another way the matrix is degenerate (degenerate), then we have no way to use the normal equation to find theta. Fortunately, regularization also solves this problem for us, specifically, as long as the regular parameters are strictly greater than 0, you can prove

is reversible, therefore, the use of a regular can also take care of any XTX irreversible problem, that is, to ensure that the matrix is non-singular.
In general, supervised learning can be seen as minimizing the following objective functions:

Among them, the first L (yi,f (xi;w)) for empirical risk, measures our model (classification or regression) to the first sample of the predicted value F (xi;w) and the true label Yi before the error. Because our model is to fit our training samples, we ask this to be minimal, which is to ask our models to fit our training data as much as possible. The second item is the regularization term, which is the rule function Ω (w) of the parameter W to constrain our model as simple as possible.
Specifically, for the first loss function, if the square loss, that is the least squares, if it is hinge loss (hinge loss), that is SVM; if exp-loss, boosting; That's the logistic Regression, and so on. Different loss functions have different fitting characteristics.
There are many options for the second regularization function Ω (w), which is generally a monotonically increasing function of model complexity, the more complex the model, the greater the regularization value. For example, a rule item can be a norm of a model parameter vector. However, different choices have different constraints on the parameter w, and the results are different, but the common ones in our paper are: 0 norm, one norm, two norm, trace norm, Frobenius norm, nuclear norm and so on.
1. L0 Norm and L1 norm

The L0 norm is the number of elements in the non-0 that point to the amount. If we use the L0 norm to rule a parametric matrix W, we hope that most of W's elements are 0, which makes the parameter W sparse.
The L1 norm is the sum of the absolute values of each element in the direction, also known as the sparse rule operator (Lasso regularization).Why the L1 norm can make weights sparse. One might say, "It is the optimal convex approximation of the L0 norm." In fact, there is a more beautiful answer: any of the rules of the operator, if he is in the wi=0 of the place is not micro, and can be decomposed into a "sum" form, then this rule operator can be implemented sparse. The L1 norm of W is absolute, |w| is not micro at w=0,
why L0 and L1 can be sparse, but commonly used for L1. Because L0 norm is difficult to optimize solution (NP difficult problem), second, L1 norm is the optimal convex approximation of L0 norm, and it is easier to solve than L0 norm. So we turn our gaze and the myriad favors to the L1 norm.

In conclusion, the L1 norm and the L0 norm can be sparse, and L1 is widely used because it has better optimization solution than L0.
SoWhat are the benefits of sparse parameters?it. Here are two points:
1) Feature selection (Feature Selection):
One of the key reasons that people flock to sparse rule is that it can realize automatic selection of feature. In general, most of the elements of Xi (that is, features) are not related to the final output of Yi, or do not provide any information, when minimizing the objective function to consider the additional characteristics of Xi, although a smaller training error can be obtained, but in the prediction of new samples, the useless information will be considered, Thus interferes with the prediction of the correct Yi. The introduction of sparse rule operators is to completefeature Auto SelectionGlorious mission, it will learn to remove these features without information, that is, the weight of these features corresponding to the 0.
2) Explanatory (interpretability):
Another reason to favor sparse is that the model is easier to interpret. For example, the probability of a disease is Y, and the data we collect is x 1000-dimensional, that is, we need to find out how these 1000 factors affect the probability of the disease. Let's say this is a regression model: y=w1*x1+w2*x2+...+w1000*x1000+b (of course, in order for Y to limit the range of [0,1], you usually have to add a logistic function). Through learning, if the last learning w* only a few non-0 elements, such as only 5 non-zero WI, then we have reason to believe that these corresponding characteristics in the disease analysis above the information provided is huge, decision-making. That is to say, the patient is not suffering from this disease only with these 5 factors, the doctor is much better analysis. But if 1000 wi is not 0, doctors face these 1000 kinds of factors, tired sleep do not love.2, L2 norm

In addition to the L1 norm, there is a more popular rule-norm that is the L2 norm: | | w| | 2. It is also inferior to the L1 norm, it has two laudatory name, in the return inside, some people have its return called "Ridge Return" (Ridge Regression), some people also call it "the weight value attenuation" (weight decay). Weight decay also has a benefit, which makes the objective function become convex function, the gradient descent method and the L-BFGS can converge to the global optimal solution. The
L2 norm is the sum of the squares of each element and then the square root. We let the rule item of the L2 norm | | w| | The 2 minimum, which makes each element of W very small, is close to 0, but unlike the L1 norm, it does not make it equal to 0, but is close to 0. The smaller parameters indicate that the simpler the model is, the more it adapts to different data sets, the less likely it is to have a fitting phenomenon. why the smaller parameter description model is simpler. The one understanding is that limiting the number of parameters is actually limiting the effect of certain components of the polynomial (see the fitted graph of the model of the linear regression above), which is equivalent to reducing the number of parameters. My understanding is that if the parameters are small enough, the data is shifted a little bit more and will not affect the result, that is, strong anti-disturbance ability.
In conclusion, through the L2 norm, we can realize the limitation of the model space, and thus avoid overfitting in a certain degree. What are the benefits of the
L2 norm? Here is also two points:
1) The angle of study theory:
from the perspective of learning theory, the L2 norm can prevent overfitting and enhance the generalization ability of the model.
2) Optimize the calculation angle:
From the point of view of optimization or numerical calculation, the L2 norm helps to deal with the difficult problem of matrix inversion in the case of bad condition number. (optimization has two major problems, one is: The local minimum value, the second is: ill-condition morbid problem.) Condition number measures how much the output changes when the input changes slightly. That is, the sensitivity of the system to small changes. Condition number value Small is well-conditioned, the big is ill-conditioned. See Resources 1 for a visual explanation of 3, L1, and L2

1) Descent speed:
We know that L1 and L2 are both regularization methods, and we put the weight parameters in the cost function in L1 or L2 way. The model then tries to minimize these weight parameters. And this minimization is like a downhill process, the difference between L1 and L2 is that the "slope" is different, as follows: L1 is the absolute function of the "slope" down, and L2 is two times the function of the "slope" down. So in fact, around 0, the L1 is falling faster than the L2. So it will fall to 0 very quickly.

2) Limitations of model space:
In fact, for L1 and L2, we can write the following form:

That is, we limit the model space to a l1-ball of W. To facilitate visualization, we consider a two-dimensional case where the contour of the objective function can be drawn on the (W1, W2) plane, while the constraint becomes a norm ball with a radius of C on the plane. The first intersection of the contour line and the norm ball is the optimal solution:

Regularization of the preceding factor λ, which controls the size of the L-shape. λ, the larger the graph of L (the black box in the image above), the larger the figure of L, the smaller the graph, and the difference between the L1-ball and the L2-ball is that the L1 has a "corner" at the intersection with each axis, and the geodesic of the objective function, unless the position is very good, Most of the time they intersect in the corner. Notice that the position of the corner is sparse, for example, the intersection point in the graph has w1=0, while the higher dimension (imagine what the three-dimensional l1-ball is. In addition to the corners, there are many sides of the contour is also a large probability of becoming the first intersection of the place, and will produce sparsity. In contrast, the L2-ball has no such property, because there is no angle, so the probability that the first intersection occurs in a sparse position becomes very small. This intuitively explains why L1-regularization can produce sparsity, and the reason why l2-regularization does not work.
Therefore, one sentence summary is: L1 will tend to produce a small number of features, while the other features are 0, and L2 will choose more features, these features will be close to 0. Lasso is very useful in feature selection, and ridge is just a rule. selection of the regularization parameter λ

λ is a super parameter, the larger the λ, the more important the rule term is than the model training error, that is to say, we would prefer our model to meet our constrained Ω (w) characteristics, compared to the model to fit our data. But generalization performance is not a simple function of λ, it has a lot of local maximum value, and it has a large search space.
Therefore, the first is to try a lot of experience, and the other is to choose by analyzing our model. That is, before training, we probably calculate the value of the loss item at this time. What is the value of Ω (W). Then to determine our λ for their proportions, this heuristic approach narrows our search space. The other most common approach is cross-validation of validation.
For L1, the larger the λ, the easier it is for f (x) to take the minimum value at x=0.
The larger the l2,λ, the lower the gradient, the faster the θj attenuation, the smaller the radius of the L2 circle, and the smaller the parameters when the maximum value of the cost function is finally obtained.

Resources:
Norm rule in machine learning (i) L0, L1 and L2 norm (very well written)
Norm rule in machine learning (II.) kernel norm and rule item parameter selection
Machine Learning Note 4 regularization
Hangyuan Li "Statistical Learning method"
"PRML"
Ng's Machine learning video

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.