L2 norm
In addition to the l1 norm, there is also a more favored normalization norm that is L2 norm: | w | 2. It is not inferior to the l1 norm. It has two good terms. In regression, someone calls it ridge regression ), someone calls it "Weight attenuation weight decay ". This is a lot to use, because its powerful effect is a very important problem in improving Machine Learning: overfitting. As for what overfitting is, the above explains that the error during model training is very small, but the error during testing is very large, that is, our model is so complicated that we can fit all of our training samples, but it is a bad mess when we actually predict new samples. In general, the test-taking capability is very strong, and the practical application capability is very poor. He is good at memorizing knowledge, but does not know how to use it flexibly. Example (from Ng's course ):
The above figure shows linear regression. The following figure shows logistic regression, or classification. Underfitting (also known as high-bias) and overfitting (also known as high variance) are two cases from left to right. As you can see, if the model is complex (you can fit any complex function), it allows our model to fit all the data points, that is, there is basically no error. For regression, our Function Curve passes all the data points, such as the Right. For classification, it means that our function curve should classify all data points correctly, for example, right. The two situations are obviously overfitted.
Okay, now we have a very critical issue. Why can L2 norm prevent overfitting? Before answering this question, we must first check what L2 norm is.
L2 norm points to the sum of squares of each element and then returns the square root. We minimum the L2 norm rule item | w | 2 so that every element of W is very small and close to 0, but it is different from the l1 norm, it will not make it equal to 0, but close to 0. There is a big difference here. The smaller the parameter, the simpler the model. The simpler the model, the less prone to over-fitting. Why does the smaller the parameter mean the simpler the model? I don't understand either. My understanding is: if the parameter is limited to a very small value, it actually limits the influence of certain components of the polynomial to a very small value (see the fitting figure of the linear regression model above ), this reduces the number of parameters. I don't know much about it either. I hope you can give me some advice.
Here is a summary: Through the L2 norm, we can implement the limitations on the model space, thus avoiding overfitting to a certain extent.
What are the benefits of L2 Norm? Here we also talk about two points:
1) theoretical perspective:From the perspective of learning theory, L2 norms can prevent overfitting and improve the generalization ability of the model.
2) optimization calculation:
From the perspective of optimization or numerical calculation, L2 norm helps to solve the difficult problem of Matrix Inversion when condition number is poor. Ah, wait. What is the condition number? I Will Google it first.
Here we will also talk about Optimization in an elegant way. There are two major problems in optimization: local minimum and ill-condition. I will not talk about the former. You can understand it. We are looking for the global minimum value. If there are too many local minimum values, our optimization algorithm will easily fall into the local minimum and cannot extricate themselves, this is obviously not the plot that the audience is willing to see. Let's talk about ill-condition. Ill-condition corresponds to well-condition. What do they represent? Suppose we have a equations Ax = B, and we need to solve X. If A or B changes slightly, the solution of x changes greatly. Then, the system of this equations is ill-condition, and vice versa. Let's take a specific example:
Let's first look at the one on the left. Assume that the first row is our Ax = B, and the second row is slightly changed to B. The difference between x and X is very different. In the third row, we slightly changed the coefficient matrix A, and we can see that the results have also changed a lot. In other words, the solution of this system is too sensitive to the matrix A or B. Also, because our matrix A and matrix B are estimated from the experimental data, there is an error. If our system can tolerate this error, it will be fine, but the system is too sensitive to this error, so that the error of our solution is greater, then this solution is too unreliable. So the system of this equations is ill-conditioned. It is abnormal, unstable, and problematic. Haha. This is clear. The one on the right is the well-condition system.
Let's try again. For a ill-condition system, when my input changes a little, the output changes a lot. This is not good, this indicates that our system is not practical. For example, for a regression problem, y = f (x), we use training sample X to train model F, so that y tries to output the expected value, for example, 0. If we encounter a sample x', the difference between this sample and the training sample X is very small. In the face of it, the system should have output a value similar to y above, for example, 0.00001, in the end, I output a 0.9999 error. This is obviously incorrect. It's like a person you are very familiar with has a acne on his face, and you don't know him anymore, so your brain is so bad, haha. So if a system is ill-conditioned, We will doubt its results. So how much do you trust it? We have to look for a standard to measure it. Because some systems are not so seriously ill, we can still believe its results. We cannot make a one-size-fits-all decision. Finally, the condition number above is used to measure the reliability of the ill-condition system. The condition number is a measure of the changes that occur to the output when the input changes slightly. That is, the system's sensitivity to minor changes. If the value of condition number is small, well-conditioned is used. If the value is large, ill-conditioned is used.
That is, multiply the norm of matrix A by its inverse norm. So the specific value depends on what norm you choose. If square matrix A is strange, then the condition number of A is positive infinity. In fact, each reversible square matrix has a condition number. But to calculate it, we need to first know the norm (Norm) and machine Epsilon (machine precision) of this square matrix ). Why do we need a norm? The norm is equivalent to measuring the size of a matrix. We know that there is no size in the Matrix. When we do not want to measure the change of a matrix A or vector B, does our solution x change? So there must be something to measure the size of the matrix and vector, right? By the way, it is a norm, indicating the matrix size or vector length. OK. After a relatively simple proof, for Ax = B, we can draw the following conclusions:
That is to say, the relative changes of our solution x and the relative changes of A or B are like the above relationship. The value of K (A) is equivalent to the rate. Have you seen it? It is equivalent to the changing world of X.
To sum up the condition number, the condition number is a measurement of the stability or sensitivity of a matrix (or the linear system it describes). If the condition number of a matrix is near 1, well-conditioned. If it is greater than 1, it is ill-conditioned. If a system is ill-conditioned, do not trust the output.
Well, it is much better for such a thing. By the way, why did we talk about this? Back to the first sentence: From the Perspective of optimization or numerical calculation, the L2 norm helps to solve the difficult problem of matrix inverse in the case of poor condition number. For linear regression, if the objective function is quadratic, there is actually a parsing solution. The optimal solution can be obtained after the derivation and the derivative is equal to zero:
However, if the number of samples X is smaller than the dimension of each sample, the matrix xtx will not be full-rank, that is, xtx will become irreversible, therefore, w * cannot be calculated directly. Or, more specifically, there will be an infinite number of solutions (because the number of our equations is smaller than the number of unknown, that is, y = DX in the compression Sensor, the initial measurement matrix is M * n (m <n )). That is to say, our data is not enough to determine a solution. If we randomly select a solution from all feasible solutions, it is likely that it is not a correct solution. In short, we have fitting.
However, if the L2 rule item is added, the following problem occurs:
Here, the major points are described as follows: to obtain this solution, we usually do not directly find the inverse of the matrix, but calculate it by Solving Linear Equations (such as Gaussian elimination method. When there are no rule items, that is, when λ = 0, if the matrix xtx has a large condition number, the solution of the linear equations will be quite unstable in the numerical value, the introduction of this rule item can improve the condition number.
In addition, if an iterative optimization algorithm is used, the large condition number will still lead to problems: it will slow down the iteration convergence speed, and the Rule item from the perspective of optimization, actually, it turns the target function into lambda-strongrong convex (λ Strong Convex. Alas, here is a strong convex λ. What is a strong convex λ?
When F is satisfied, we call f a Lambda-stronglyconvex function, where the parameter λ> 0. When λ = 0, it is returned to the definition of the common convex function.
Before we can intuitively understand the convex aspect, Let's first look at the general convex aspect. Suppose we want F to make the first-order Taylor approximation in the place of X (did we forget the first-order Taylor expansion? F (x) = f (a) + f' (a) (x-A) + O (| X-A | ).):
Intuitively, convex refers to the fact that the function curve is located at the tangent of the vertex, that is, the linear approximation, and strongrong convex further requires a quadratic function located above the vertex, that is to say, the function should not be too "flat", but be sure to have a certain "upward bending" trend. Specifically, convex can ensure that a function is on top of its first-order Taylor function at any point, however, stronugly convex ensures that the function has a very nice quadratic lower bound at any point. Of course, this is a strong assumption, but it is also a very important assumption. It may not be easy to understand, so let's draw a picture to understand the image.
As soon as you see the figure above, it is completely clear. Don't bother me. I suggest you try again. We take the place where our optimal solution is w. If our function f (W) is shown in the left graph, that is, the red function, it will be located above the quadratic function of the blue dotted line, in this way, even when wt and W * are close to each other, the values of F (wt) and F (w *) are quite different, that is to say, there will be a large gradient value near our optimal solution w *, so that we can reach w * within a relatively small number of iterations *. However, for the right graph, the red function f (W) is restricted only on a linear blue dotted line. It is assumed that the right graph is unfortunate (very flat ), when wt is far away from our best w *, our Approximate Gradient (f (wt)-f (w *)/(wt-W *) it is very small, and the approximate gradient in WT is smaller than F/W, in this way, the gradient decreases wt + 1 = wt-α * (∂ F/∂ W), and the result we get is that the change of W is very slow, like a snail bait, the greatest advantage of slowly crawling to us is w *, which is far from our advantage in a limited iteration time.
Therefore, the convex alone does not guarantee that the point W obtained when the gradient decreases and the number of iterations is a better approximate point w *., in some cases, it is also a method to normalize or improve generalization performance to stop iteration close to the optimum ). As analyzed above, if F (W) is very flat around the global minimum point w *, we may find a very long point. However, if we have a "Strong Convex", we can control the situation and get a better approximate solution. As for how good it is, there is a bound here. The quality of this bound depends on the size of constant α in the strongrong convex properties. I don't know if everyone is smart. What should I do if I want to obtain stronugly convex? The simplest is to add a (α/2) * | w | 2 to it.
Er, it takes so much space to talk about stronugly convex. In fact, in gradient descent, the upper bound of the Convergence Rate of the target function is actually related to the condition number of the matrix xtx. The smaller the condition number of xtx, the smaller the upper bound, that is, the faster the convergence speed.
Conclusion: L2 norm not only prevents overfitting, but also makes our optimization solution stable and fast.
The difference between L1 and L2: why is there a big difference between a minimum absolute value and a minimum square? I see two types of geometric analysis:
1) Speed of reduction:
We know that L1 and L2 are both normalized. We put the weight parameter in the price function in L1 or L2 mode. Then the model tries to minimize these parameters. This minimization is like a downhill process. The difference between L1 and L2 lies in the difference between this "slope". For example, L1 falls down by the "slope" of the absolute value function, while L2 falls according to the "slope" of quadratic functions. So in fact, in the vicinity of 0, L1 decreases faster than L2. So it will be very fast to 0. However, I think the explanation here is not pertinent. Of course, I don't know if it is a problem I understand.
L1 is called lasso on the rivers and lakes, and L2 is called Ridge. However, these two names are quite confusing. Looking at the above picture, lasso looks like ridge, while ridge looks like lasso.
2) Restrictions on model space:
In fact, for L1 and L2 normalized cost functions, we can write them in the following form:
That is to say, we limit the model space to a L1-ball of W. To facilitate visualization, we can plot the contour lines of the target function on the (W1, W2) plane, the constraint is a norm ball with a radius of C on the plane. The first intersection of a contour line and a norm ball is the optimal solution:
We can see that the difference between the L1-ball and the L2-ball is that L1 has a "angle" at the intersection of each coordinate axis, and the ground line of the target function unless the position is very good, most of the time, it will intersection in the corner. Note that the location of the angle will produce sparsity, in this example, the intersection of W1 = 0, and higher dimensions (imagine what the three-dimensional L1-ball is like ?) In addition to corner points, there are also many sides of the contour that both have a great probability to become the first intersection, and will produce sparsity.
In contrast, L2-ball has no such property, because there is no angle, so the probability of the first intersection appearing in a sparse position becomes very small. This intuitively explains why L1-regularization can produce sparsity, And why L2-regularization can't.
Therefore, in a word, L1 tends to generate a small number of features, while other features are 0, while L2 chooses more features, which are close to 0. Lasso is useful in Feature Selection, and ridge is just a normalization.
Article reference: http://blog.csdn.net/zouxy09/article/details/24971995
Sparse Coding learning notes (2) L2 norm