Reprinted article: Norm Rule in machine learning (i) L0, L1 and L2 norm
[Email protected]
Http://blog.csdn.net/zouxy09
Today we talk about the very frequent problems in machine learning: overfitting and regulation. Let's begin by simply understanding the L0, L1, L2, and kernel norm rules that are commonly used. Finally, we talk about the selection of the rule item parameter. Because of the size of the space here, in order not to scare everyone, I will divide this five part into two blog posts. Knowledge is limited, the following are some of my superficial views, if the understanding of the error, I hope that everyone to correct. Thank you.
The supervised machine learning problem is nothing more than "Minimizeyour error while regularizing your parameters", which is to minimize errors while the parameters are being parameterized. The minimization error is to let our model fit our training data, and the rule parameter is to prevent our model from overfitting our training data. What a minimalist philosophy! Because too many parameters will cause our model complexity to rise, easy to fit, that is, our training error will be very small. But the small training error is not our ultimate goal, our goal is to hope that the model test error is small, that is, to accurately predict new samples. Therefore, we need to ensure that the model is "simple" based on the minimization of training errors, so that the resulting parameters have good generalization performance (that is, the test error is also small), and the model "simple" is the rule function to achieve. In addition, the use of rule items can also constrain the characteristics of our models. In this way, people's prior knowledge of the model can be incorporated into the learning of the model, forcing the learning model to have the characteristics that people want, such as sparse, low rank, smoothing and so on. You know, sometimes a priori is very important. Previous experience will let you take a lot less detours, this is why we usually learn the best to find a Daniel belt of reason. A word can help us to push through the dark clouds, but also we a blue sky, clairvoyant. The same is true for machine learning, and if we were to give a little nudge, it would certainly be able to learn the task more quickly. Just because the communication between man and machine is not so straightforward now, the medium can only be served by the rules.
There are several ways to look at the rules. The rule is in accordance with the principle of the Occam's razor (S razor). That's a good name, razor!. But its thinking is very approachable: in all possible models, we should choose a model that explains well the known data and is very simple. From the Bayesian estimation point of view, the rule term corresponds to the prior probability of the model. It is also said that the rule is the implementation of the structural risk minimization strategy, which is to add a regularization item (Regularizer) or penalty (penalty term) to the empirical risk.
In general, supervised learning can be seen as minimizing the following objective functions:
Among them, the first L (yi,f (XI;W)) Measures our model (classification or regression) to the first sample of the predicted value F (xi;w) and the true label Yi before the error. Because our model is to fit our training samples, we ask this to be minimal, which is to ask our models to fit our training data as much as possible. But as mentioned above, we not only want to ensure that the training error is minimal, we would like our model test error is small, so we need to add the second item, that is, the parameter w of the regular function Ω (W) to constrain our model as simple as possible.
OK, here, if you have been in the machine learning for many years, you will find that, ouch, most of the machine learning with the model is not only the shape of the likeness, and spirit. Yes, most of it is simply a change of these two items. For the first loss function, if it is square loss, it is the least squares, and if it is hinge loss, that is the famous SVM; if it is exp-loss, it is boosting; if it is log-loss, it is the logistic regression, and so on. Different loss functions have different fitting properties, and this also has to be specifically analyzed for specific problems. But here, we do not investigate the problem of the loss function, we turn our gaze to "rule item Ω (W)".
The Rule function Ω (W) also has a lot of choices, generally is the model complexity of the monotonically increasing function, the more complex the model, the greater the rule value. For example, a rule item can be a norm of a model parameter vector. However, different choices have different constraints on the parameter w, and the results are different, but the common ones in our paper are: 0 norm, one norm, two norm, trace norm, Frobenius norm, nuclear norm and so on. So many norms, what exactly do they mean? What is the ability? When is it going to work? When do I need to use it? No hurry, let's pick a few common words to explain.
First, L0 norm and L1 norm
The L0 norm is the number of elements in the non-0 that point to the amount. If we use the L0 norm to rule a parametric matrix W, we hope that most of W's elements are 0. This is too intuitive, too explicit, in other words, let the parameter w is sparse. OK, see the word "sparse", we should be from the current Fengfenghuohuo "compression perception" and "sparse coding" in the Wake up, the original use of the "sparse" is through this thing to achieve. But you're starting to wonder, is that it? See the papers world, sparse not all through the L1 norm to achieve it? Is it everywhere in my head? | | w| | 1 Shadows! Almost looked up to see the bow. Yes, that's the reason the topic put L0 and L1 together because they have some kind of unusual relationship. Then let's see what the L1 norm is. Why is it possible to achieve sparse? Why do we use L1 norm to achieve sparse, rather than L0 norm?
The L1 norm is the sum of the absolute values of each element in the direction, and also a laudatory name called "Sparse rule Operator" (Lasso regularization). Now let's analyze the question of this value of 100 million: Why does the L1 norm make weights sparse? One might say, "It is the optimal convex approximation of the L0 norm." In fact, there is a more beautiful answer: any of the rules of the operator, if he is in the wi=0 of the place is not micro, and can be decomposed into a "sum" form, then this rule operator can be implemented sparse. That said, the L1 norm of W is absolute, |w| is not micro at w=0, but it is not intuitive enough. This is because we need to perform a comparative analysis with the L2 norm. So, for a visual understanding of the L1 norm, look at section two later.
Yes, there is another problem: since L0 can be sparse, why not L0, but to use L1? Personal understanding one is because the L0 norm is difficult to optimize the solution (NP difficult problem), the second is the L1 norm is the L0 norm of the optimal convex approximation, and it is easier than the L0 norm to optimize the solution. So we turn our gaze and the myriad favors to the L1 norm.
OK, here's a word. Summing up: The L1 norm and the L0 norm can be sparse, and L1 is widely used because it has better optimization solution than L0.
Well, here we probably know that L1 can be sparse, but we would like to, why sparse? What are the benefits of letting our parameters be sparse? Here are two points:
1) Feature selection (Feature Selection):
One of the key reasons that people flock to sparse rule is that it can realize automatic selection of feature. In general, most of the elements of Xi (that is, features) are not related to the final output of Yi, or do not provide any information, when minimizing the objective function to consider the additional characteristics of Xi, although a smaller training error can be obtained, but in the prediction of new samples, the useless information will be considered, Thus interferes with the prediction of the correct Yi. The introduction of sparse rule operator is to accomplish the glorious mission of automatic feature selection, it will learn to remove these features without information, that is, the weights corresponding to these features are set to 0.
2) Explanatory (interpretability):
Another reason to favor sparse is that the model is easier to interpret. For example, the probability of a disease is Y, and the data we collect is x 1000-dimensional, that is, we need to find out how these 1000 factors affect the probability of the disease. Let's say this is a regression model: y=w1*x1+w2*x2+...+w1000*x1000+b (of course, in order for Y to limit the range of [0,1], you usually have to add a logistic function). Through learning, if the last learning w* only a few non-0 elements, such as only 5 non-zero WI, then we have reason to believe that these corresponding characteristics in the disease analysis above the information provided is huge, decision-making. That is to say, the patient is not suffering from this disease only with these 5 factors, the doctor is much better analysis. But if 1000 wi is not 0, doctors face these 1000 kinds of factors, tired sleep do not love.
Second, L2 norm
In addition to the L1 norm, there is a more popular rule norm that is the L2 norm: | | w| | 2. It is also inferior to the L1 norm, it has two laudatory name, in the return inside, some people have its return called "Ridge Return" (Ridge Regression), some people also call it "the weight value attenuation weight decay". This is a lot of use, because its powerful effect is to improve machine learning inside a very important problem: overfitting. As for what the overfitting is, it also explains that the error in the model training is very small, but the error at the time of the test is very large, that is, our model is so complex that we can fit all our training samples, but in the actual prediction of a new sample, a terrible mess. The popular speaking is the examination ability is very strong, the actual application ability is very poor. Good at reciting knowledge, but do not know how to use knowledge flexibly. Examples are shown (course from NG):
The graph above is linear regression, the following figure is a logistic regression, it can be said that the case of classification. From left to right are under-fitting (underfitting, also known as High-bias), suitable fitting and overfitting (overfitting, also known as high variance) in three cases. As you can see, if the model is complex (you can fit any complex function), it allows our model to fit all the data points, which is basically no error. For regression, our function curve passes through all the data points, such as right. For classification, it is our function curve to classify all data points correctly, such as right. These two cases are obviously well-fitted.
OK, so now to our very critical question, why L2 norm can prevent overfitting? Before we answer this question, we have to look at what the L2 norm is.
The L2 norm is the sum of the squares of each element and then the square root. We let the rule item of the L2 norm | | w| | 2 the smallest, can make each element of W is very small, are close to 0, but unlike the L1 norm, it does not make it equal to 0, but close to 0, here is a very big difference oh. The smaller the parameter, the simpler the model is, the less likely it is to have a fitting phenomenon. Why is the smaller parameter description model simpler? I do not understand, my understanding is: Limiting the parameters is very small, in fact, limiting the number of components of the polynomial effect is very small (see the model of linear regression above the fitting diagram), which is equivalent to reducing the number of parameters. In fact, I do not know very well, I hope you can point out.
Here is a sentence summed up: through the L2 norm, we can achieve the model space constraints, to a certain extent, to avoid overfitting.
What are the benefits of the L2 norm? There are also two points to be pulled here:
1) Angle of study theory:
From the perspective of learning theory, the L2 norm can prevent overfitting and enhance the generalization ability of the model.
2) Optimize the calculation angle:
From the point of view of optimization or numerical calculation, the L2 norm helps to deal with the difficult problem of matrix inversion in the case of bad condition number. Hey, wait, what's this condition number? I'm going to Google a little bit.
Here we also pretend to be elegant to talk about optimization problems. Optimization has two major problems, one is: The local minimum value, the second is: ill-condition morbid problem. The former I do not say, we all understand, we are looking for the global minimum, if the local minimum value too much, then our optimization algorithm is easy to fall into the local minimum and extricate itself, this is obviously not the audience is willing to see the plot. Then let's talk about Ill-condition. The ill-condition corresponds to the well-condition. So what do they stand for separately? Let's say we have an equation set ax=b, we need to solve X. If a or B changes slightly, the solution of X will make a great change, then the system of equations is ill-condition, and vice versa is well-condition. Let's give an example:
Let's look at the left one first. The first line assumes our ax=b, the second line we slightly change the next B, get the X and not change before the difference is very big, see it. In the third row we change the coefficient matrix A a little, and we can see that the results are also very large. In other words, the solution to the system is too sensitive to the coefficient matrix A or B. Also because our coefficients matrix A and B are estimated from the experimental data, so it is error, if our system is tolerable to this error is OK, but the system is too sensitive to this error, so that the error of our solution is greater, then this solution is too unreliable. So this system of equations is ill-conditioned morbid, abnormal, unstable, problematic, haha. That's clear. The one on the right is called Well-condition's system.
Again, for a ill-condition system, my input slightly changed, the output has changed a lot, this is not good, this indicates that our system is not practical ah. You know, for example, for a regression problem y=f (x), we train the model F with training sample x so that Y can output the values we expect, such as 0. If we encounter a sample X ', this sample and training sample x difference is very small, facing him, the system should have output and the above y similar value, such as 0.00001, but finally gave me a 0.9999, which is obviously wrong. As if you are familiar with a person with a pimple on the face, you do not know him, then your brain is too bad, haha. So if a system is ill-conditioned morbid, we will have doubts about its outcome. So how much do you believe it is? We have to find a standard to measure, because some systems of disease is not so heavy, its results can be believed, not fits it. Finally came back, the above condition number is to take to measure the credibility of the ill-condition system. Condition number measures how much the output changes when the input changes slightly. That is, the sensitivity of the system to small changes. Condition number value Small is well-conditioned, the big is ill-conditioned.
If square A is non-singular, then A's conditionnumber is defined as:
Which is the norm of matrix a multiplied by its inverse norm. So the specific value is how much, depends on your choice of norm is what. If the square a is singular, then the condition number of a is just infinity. In fact, each reversible phalanx has a condition number. But if we want to calculate it, we need to know the norm (norm) and machine Epsilon (precision) of this square. Why Norm? The norm is equivalent to measuring the size of a matrix, we know that the matrix is no size, when the above is not to measure a matrix A or vector b changes, our solution x change size? So there must be something to measure the size of the matrix and the vector? By the way, he is the norm, indicating the size of the matrix or the length of the vector. OK, after a relatively simple proof, for ax=b, we can get the following conclusions:
That is, the relative change of our solution x and a or B relative change is like the above relationship, where K (A) value is equivalent to magnification, see? Equivalent to the X-change bounds.
One sentence to condition number: Conditionnumber is a measure of the stability or sensitivity of a matrix (or the linear system it describes), if the condition number of a matrix is near 1, Then it is well-conditioned, if far greater than 1, then it is ill-conditioned, if a system is ill-conditioned, its output will not be too much to believe.
Well, for such a thing, already said much better. By the right, why did we talk about this? Back to the first sentence: from the point of view of optimization or numerical calculation, the L2 norm helps to deal with the difficult problem of matrix inversion in the case of bad condition number. Because the objective function is two times, for linear regression, that is actually the analytic solution, the derivation and the derivative equal to zero can get the optimal solution:
However, if the number of our sample x is smaller than the dimensions of each sample, the matrix XTX will not be full-rank, that is xtx will become irreversible, so w* can not be directly calculated. Or more precisely, there will be infinitely many solutions (because the number of our equations is less than the number of unknowns). That is, our data is not enough to determine a solution, and if we randomly select one from all feasible solutions, it is probably not really a good solution, in a nutshell, we are fitting.
But if you add the L2 rule item, it becomes the following situation, you can directly reverse the:
In this, the description of the professional point is: To get this solution, we usually do not directly seek the inverse of the matrix, but by solving the equations of linear systems (such as Gaussian elimination method) to calculate. Considering the absence of a rule term, that is, the case of λ=0, if the matrix XTX condition number is large, the solution of the linear equations will be quite unstable in numerical terms, and the introduction of this rule can improve condition.
Also, if you use an iterative optimization algorithm, condition number is too large to cause problems: it slows down the convergence of iterations, and the rule item from an optimization point of view is actually turning the objective function into a λ-strongly convex (λ-convex). Ouch, here again there is a λ strong convex, what is called λ strong convex?
When F satisfies:
, we call f the Λ-stronglyconvex function, where the parameter is λ>0. When Λ=0 is returned to the definition of the normal convex function.
Before the intuitive description of the strong convex, we first look at what the ordinary convex is. Suppose we let F do a first-order Taylor approximation where X is (first-order Taylor expansion forget it?). F (x) =f (a) +f ' (a) (x-a) +o (| | x-a| |). ):
Intuitively speaking, the convex property refers to the tangent of the function curve at that point, that is, the linear approximation, while the strongly convex further requires a two-time function above it, which means that the function should not be too "flat" but can guarantee a certain "upward bending" trend. The professional point is that convex can guarantee that the function is on its first-order Taylor function at any point, while strongly convex can guarantee that the function has a very beautiful two-time nether quadratic lower bound at any point. This is, of course, a strong hypothesis, but it is also a very important assumption. may not be easy to understand, then we draw a picture to understand the image.
As soon as we see the picture above, we all understand it. I'm not going to nag you. Let's have a little bit of a nag. We take our optimal solution to the w* place. If our function f (w), see left, that is, red that function, will be located on the blue dotted line of the root of the two function, so that even if the WT and w* close, the value of f (WT) and F (w*) difference is very large, that is to ensure that our optimal solution w* near the time, There is also a large gradient value so that we can reach w* within a relatively small number of iterations. But for the right, the Red function f (w) is constrained only on a linear blue dashed line, assuming it is unfortunate (very flat) on the right, and that the approximate gradient (f (WT)-f (w*))/(wt-w*) is very small when WT is w* far from our optimal point. The approximate gradient ∂f/∂w at WT is even smaller, so that by gradient descent wt+1=wt-α* (∂f/∂w), we get the result that w changes very slowly, like a snail, very slowly to our optimal point w* crawl, that in the finite iteration time, it is far from our optimal point.
So just by convex nature does not guarantee that the gradient drop and the finite number of iterations in the case of the point W will be a good approximation of the global minimum point w* (in the words, there is a place to say, in fact, let the iteration near the best place to stop, is also a rule or improve the generalization performance method). As analyzed above, if f (w) is very flat around the global minimum point w*, we may find a far point. But if we have "strong convex", we can control the situation, we can get a better approximate solution. As for how good, there is a bound, this bound is also dependent on the quality of strongly convex the nature of the constant α size. See here, do not know that we learn to be smart No. What do you do if you want to get strongly convex? The simplest thing is to add an item (Α/2) *| | w| | 2.
Well, it took so much space to tell a strongly convex. In fact, in gradient descent, the upper bound of the convergence rate of the objective function is actually related to the condition number of the matrix XTX, the smaller the XTX condition number, the smaller the upper bound, that is, the faster the convergence.
This one optimizes to say so many things. One sentence to summarize: the L2 norm not only prevents overfitting, but also allows our optimization solution to be stable and fast.
OK, here to honor the above commitments, to intuitively talk about the difference between L1 and L2, why a let the absolute value of the smallest, a minimum of the square, there will be so big difference? I have seen two geometrically intuitive parsing:
1) Descent speed:
We know that both L1 and L2 are in a regular way, and we put the weights in L1 or L2 way into the cost function. The model then tries to minimize these weight parameters. And this minimization is like a downhill process, the difference between L1 and L2 is that the "slope" is different, such as: L1 is the absolute function of the "slope" down, and L2 is two times the function of the "slope" down. So in fact, around 0, the L1 is falling faster than the L2. So it will fall to 0 very quickly. But I think the explanation here is not very pertinent, of course, do not know whether they understand the problem.
L1 in the lake, lasso,l2 person called Ridge. But these two names are quite confusing, look at the picture above, Lasso's figure looks like ridge, and Ridge's figure looks like lasso.
2) Limitations of model space:
In fact, for L1 and L2 the cost function of the rule, we can write the following form:
In other words, we limit the model space to a l1-ball of W. To facilitate visualization, we consider a two-dimensional case where the contour of the objective function can be drawn on the (W1, W2) plane, while the constraint becomes a norm ball with a radius of C on the plane. The best solution is where the contour line intersects with the norm Ball for the first time:
As you can see, the difference between the L1-ball and the L2-ball is that the L1 has "horns" in place where each axis intersects, and that the geodesic of the objective function will intersect at the corner most of the time unless the position is very well placed. Notice that the position of the corner will be sparse, the intersection point in the example has w1=0, and the higher dimension (imagine what the three-dimensional l1-ball is?). In addition to the corners, there are many sides of the contour is also a large probability of becoming the first intersection of the place, and will produce sparsity.
By contrast, L2-ball has no such property, because there is no angle, so the probability that the first intersection occurs in a sparse position becomes very small. This intuitively explains why L1-regularization can produce sparsity, and the reason why l2-regularization does not work.
Therefore, one sentence summary is: L1 will tend to produce a small number of features, while the other features are 0, and L2 will choose more features, these features will be close to 0. Lasso is very useful in feature selection, and ridge is just a rule.
OK, I'll talk to you here. Next post let's talk about the kernel norm and the selection of the rule item parameters. Please see the next blog post for the entire reference, which is not listed here again. Thank you.
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Norm rule in machine learning (i.) L0, L1 and L2 norm