On the rule norm in machine learning

Last Update:2015-12-29 Source: Internet

Author: User

Tags dashed line

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. Introduction of supervised learning

The supervised machine learning problem is nothing more than "Minimizeyour error while regularizing your parameters", which is to minimize errors while the parameters are being parameterized. The minimization error is to let our model fit our training data, and the rule parameter is to prevent our model from overfitting our training data. What a minimalist philosophy! Because too many parameters will cause our model complexity to rise, easy to fit, that is, our training error will be very small. But the small training error is not our ultimate goal, our goal is to hope that the model test error is small, that is, to accurately predict new samples. Therefore, we need to ensure that the model is "simple" based on the minimization of training errors, so that the resulting parameters have good generalization performance (that is, the test error is also small), and the model "simple" is the rule function to achieve. In addition, the use of rule items can also constrain the characteristics of our models. In this way, people's prior knowledge of the model can be incorporated into the learning of the model, forcing the learning model to have the characteristics that people want, such as sparse, low rank, smoothing and so on. You know, sometimes a priori is very important. Previous experience will let you take a lot less detours, this is why we usually learn the best to find a Daniel belt of reason. A word can help us to push through the dark clouds, but also we a blue sky, clairvoyant. The same is true for machine learning, and if we were to give a little nudge, it would certainly be able to learn the task more quickly. Just because the communication between man and machine is not so straightforward now, the medium can only be served by the rules.

1.1 Rule of law

There are several ways to look at the rules. The rule is in accordance with the principle of the Occam's razor (S razor). That's a good name, razor!. But its thinking is very approachable: in all possible models, we should choose a model that explains well the known data and is very simple. From the Bayesian estimation point of view, the rule term corresponds to the prior probability of the model. It is also said that the rule is the implementation of the structural risk minimization strategy, which is to add a regularization item (Regularizer) or penalty (penalty term) to the empirical risk.

1.2 General form of supervised learning

In general, supervised learning can be seen as minimizing the following objective functions:

Among them, the first L (yi,f (XI;W)) Measures our model (classification or regression) to the first sample of the predicted value F (xi;w) and the true label Yi before the error. Because our model is to fit our training samples, we ask this to be minimal, which is to ask our models to fit our training data as much as possible. But as mentioned above, we not only want to ensure that the training error is minimal, we would like our model test error is small, so we need to add the second item, that is, the parameter w of the regular function Ω (W) to constrain our model as simple as possible.

1.3 Selection of regularization function

The Rule function Ω (W) also has a lot of choices, generally is the model complexity of the monotonically increasing function, the more complex the model, the greater the rule value. For example, a rule item can be a norm of a model parameter vector. However, different choices have different constraints on the parameter w, and the results are different, but the common ones in our paper are: 0 norm, one norm, two norm, trace norm, Frobenius norm, nuclear norm and so on. So many norms, what exactly do they mean? What is the ability? When is it going to work? When do I need to use it? No hurry, let's pick a few common words to explain.

Two, L0 and L1 norm 2.1 L0 norm

The L0 norm is the number of elements in the non-0 that point to the amount. If we use the L0 norm to rule a parametric matrix W, we hope that most of W's elements are 0. This is too intuitive, too explicit, in other words, let the parameter w is sparse. OK, see the word "sparse", we should be from the current Fengfenghuohuo "compression perception" and "sparse coding" in the Wake up, the original use of the "sparse" is through this thing to achieve. But you're starting to wonder, is that it? See the papers world, sparse not all through the L1 norm to achieve it? Is it everywhere in my head? | | w| | 1 Shadows! Almost looked up to see the bow. Yes, that's the reason the topic put L0 and L1 together because they have some kind of unusual relationship. Then let's see what the L1 norm is. Why is it possible to achieve sparse? Why do we use L1 norm to achieve sparse, rather than L0 norm?

2.2 L1 Norm

The L1 norm is the sum of the absolute values of each element in the direction, and also a laudatory name called "Sparse rule Operator" (Lasso regularization). Now let's analyze the question of this value of 100 million: Why does the L1 norm make weights sparse? One might say, "It is the optimal convex approximation of the L0 norm." In fact, there is a more beautiful answer: any of the rules of the operator, if he is in the wi=0 of the place is not micro, and can be decomposed into a "sum" form, then this rule operator can be implemented sparse. That said, the L1 norm of W is absolute, |w| is not micro at w=0, but it is not intuitive enough. This is because we need to perform a comparative analysis with the L2 norm. So, for a visual understanding of the L1 norm, look at section two later.

Yes, there is another problem: since L0 can be sparse, why not L0, but to use L1? Personal understanding one is because the L0 norm is difficult to optimize the solution (NP difficult problem), the second is the L1 norm is the L0 norm of the optimal convex approximation, and it is easier than the L0 norm to optimize the solution. So we turn our gaze and the myriad favors to the L1 norm.

OK, here's a word. Summing up: The L1 norm and the L0 norm can be sparse, and L1 is widely used because it has better optimization solution than L0.

Well, here we probably know that L1 can be sparse, but we would like to, why sparse? What are the benefits of letting our parameters be sparse? Here are two points:

2.3 Sparse Benefits

1) Feature selection (Feature Selection):

One of the key reasons that people flock to sparse rule is that it can realize automatic selection of feature. In general, most of the elements of Xi (that is, features) are not related to the final output of Yi, or do not provide any information, when minimizing the objective function to consider the additional characteristics of Xi, although a smaller training error can be obtained, but in the prediction of new samples, the useless information will be considered, Thus interferes with the prediction of the correct Yi. The introduction of sparse rule operator is to accomplish the glorious mission of automatic feature selection, it will learn to remove these features without information, that is, the weights corresponding to these features are set to 0.

2) Explanatory (interpretability):

Another reason to favor sparse is that the model is easier to interpret. For example, the probability of a disease is Y, and the data we collect is x 1000-dimensional, that is, we need to find out how these 1000 factors affect the probability of the disease. Let's say this is a regression model: y=w1*x1+w2*x2+...+w1000*x1000+b (of course, in order for Y to limit the range of [0,1], you usually have to add a logistic function). Through learning, if the last learning w* only a few non-0 elements, such as only 5 non-zero WI, then we have reason to believe that these corresponding characteristics in the disease analysis above the information provided is huge, decision-making. That is to say, the patient is not suffering from this disease only with these 5 factors, the doctor is much better analysis. But if 1000 wi is not 0, doctors face these 1000 kinds of factors, tired sleep do not love.

Third, L2 norm

In addition to the L1 norm, there is a more popular rule norm that is the L2 norm: | | w| | 2. It is also inferior to the L1 norm, it has two laudatory name, in the return inside, some people have its return called "Ridge Return" (Ridge Regression), some people also call it "the weight value attenuation weight decay". This is a lot of use, because its powerful effect is to improve machine learning inside a very important problem: overfitting. As for what the overfitting is, it also explains that the error in the model training is very small, but the error at the time of the test is very large, that is, our model is so complex that we can fit all our training samples, but in the actual prediction of a new sample, a terrible mess. The popular speaking is the examination ability is very strong, the actual application ability is very poor. Good at reciting knowledge, but do not know how to use knowledge flexibly. Examples are shown (course from NG):

1) Angle of study theory:

From the perspective of learning theory, the L2 norm can prevent overfitting and enhance the generalization ability of the model.

2) Optimize the calculation angle:

From the point of view of optimization or numerical calculation, the L2 norm helps to deal with the difficult problem of matrix inversion in the case of bad condition number. Hey, wait, what's this condition number? I'm going to Google a little bit.

Here we also pretend to be elegant to talk about optimization problems. Optimization has two major problems, one is: The local minimum value, the second is: ill-condition morbid problem. The former I do not say, we all understand, we are looking for the global minimum, if the local minimum value too much, then our optimization algorithm is easy to fall into the local minimum and extricate itself, this is obviously not the audience is willing to see the plot. Then let's talk about Ill-condition. The ill-condition corresponds to the well-condition. So what do they stand for separately? Let's say we have an equation set ax=b, we need to solve X. If a or B changes slightly, the solution of X will make a great change, then the system of equations is ill-condition, and vice versa is well-condition. Let's give an example:

Let's look at the left one first. The first line assumes our ax=b, the second line we slightly change the next B, get the X and not change before the difference is very big, see it. In the third row we change the coefficient matrix A a little, and we can see that the results are also very large. In other words, the solution to the system is too sensitive to the coefficient matrix A or B. Also because our coefficients matrix A and B are estimated from the experimental data, so it is error, if our system is tolerable to this error is OK, but the system is too sensitive to this error, so that the error of our solution is greater, then this solution is too unreliable. So this system of equations is ill-conditioned morbid, abnormal, unstable, problematic, haha. That's clear. The one on the right is called Well-condition's system.

Again, for a ill-condition system, my input slightly changed, the output has changed a lot, this is not good, this indicates that our system is not practical ah. You know, for example, for a regression problem y=f (x), we train the model F with training sample x so that Y can output the values we expect, such as 0. If we encounter a sample X ', this sample and training sample x difference is very small, facing him, the system should have output and the above y similar value, such as 0.00001, but finally gave me a 0.9999, which is obviously wrong. As if you are familiar with a person with a pimple on the face, you do not know him, then your brain is too bad, haha. So if a system is ill-conditioned morbid, we will have doubts about its outcome. So how much do you believe it is? We have to find a standard to measure, because some systems of disease is not so heavy, its results can be believed, not fits it. Finally came back, the above condition number is to take to measure the credibility of the ill-condition system. Condition number measures how much the output changes when the input changes slightly. That is, the sensitivity of the system to small changes. Condition number value Small is well-conditioned, the big is ill-conditioned.

If square A is non-singular, then A's conditionnumber is defined as:

Which is the norm of matrix a multiplied by its inverse norm. So the specific value is how much, depends on your choice of norm is what. If the square a is singular, then the condition number of a is just infinity. In fact, each reversible phalanx has a condition number. But if we want to calculate it, we need to know the norm (norm) and machine Epsilon (precision) of this square. Why Norm? The norm is equivalent to measuring the size of a matrix, we know that the matrix is no size, when the above is not to measure a matrix A or vector b changes, our solution x change size? So there must be something to measure the size of the matrix and the vector? By the way, he is the norm, indicating the size of the matrix or the length of the vector. OK, after a relatively simple proof, for ax=b, we can get the following conclusions:

That is, the relative change of our solution x and a or B relative change is like the above relationship, where K (A) value is equivalent to magnification, see? Equivalent to the X-change bounds.

One sentence to condition number: Conditionnumber is a measure of the stability or sensitivity of a matrix (or the linear system it describes), if the condition number of a matrix is near 1, Then it is well-conditioned, if far greater than 1, then it is ill-conditioned, if a system is ill-conditioned, its output will not be too much to believe.

Well, for such a thing, already said much better. By the right, why did we talk about this? Back to the first sentence: from the point of view of optimization or numerical calculation, the L2 norm helps to deal with the difficult problem of matrix inversion in the case of bad condition number. Because the objective function is two times, for linear regression, that is actually the analytic solution, the derivation and the derivative equal to zero can get the optimal solution:

(Personal understanding this is the analytic solution, xw=y, but uses the generalized inverse)

However, if the number of our sample x is smaller than the dimensions of each sample, the matrix XTX will not be full-rank, that is xtx will become irreversible, so w* can not be directly calculated. Or more precisely, there will be infinitely many solutions (because the number of our equations is less than the number of unknowns). That is, our data is not enough to determine a solution, and if we randomly select one from all feasible solutions, it is probably not really a good solution, in a nutshell, we are fitting.

But if you add the L2 rule item, it becomes the following situation, you can directly reverse the:

In this, the description of the professional point is: To get this solution, we usually do not directly seek the inverse of the matrix, but by solving the equations of linear systems (such as Gaussian elimination method) to calculate. Considering the absence of a rule term, that is, the case of λ=0, if the matrix XTX condition number is large, the solution of the linear equations will be quite unstable in numerical terms, and the introduction of this rule can improve condition.

Also, if you use an iterative optimization algorithm, condition number is too large to cause problems: it slows down the convergence of iterations, and the rule item from an optimization point of view is actually turning the objective function into a λ-strongly convex (λ-convex). Ouch, here again there is a λ strong convex, what is called λ strong convex?

When F satisfies:

, we call f the Λ-stronglyconvex function, where the parameter is λ>0. When Λ=0 is returned to the definition of the normal convex function.

Before the intuitive description of the strong convex, we first look at what the ordinary convex is. Suppose we let F do a first-order Taylor approximation where X is (first-order Taylor expansion forget it?). F (x) =f (a) +f ' (a) (x-a) +o (| | x-a| |). ）：

Intuitively speaking, the convex property refers to the tangent of the function curve at that point, that is, the linear approximation, while the strongly convex further requires a two-time function above it, which means that the function should not be too "flat" but can guarantee a certain "upward bending" trend. The professional point is that convex can guarantee that the function is on its first-order Taylor function at any point, while strongly convex can guarantee that the function has a very beautiful two-time nether quadratic lower bound at any point. This is, of course, a strong hypothesis, but it is also a very important assumption. may not be easy to understand, then we draw a picture to understand the image.

As soon as we see the picture above, we all understand it. I'm not going to nag you. Let's have a little bit of a nag. We take our optimal solution to the w* place. If our function f (w), see left, that is, red that function, will be located on the blue dotted line of the root of the two function, so that even if the WT and w* close, the value of f (WT) and F (w*) difference is very large, that is to ensure that our optimal solution w* near the time, There is also a large gradient value so that we can reach w* within a relatively small number of iterations. But for the right, the Red function f (w) is constrained only on a linear blue dashed line, assuming it is unfortunate (very flat) on the right, and that the approximate gradient (f (WT)-f (w*))/(wt-w*) is very small when WT is w* far from our optimal point. The approximate gradient ∂f/∂w at WT is even smaller, so that by gradient descent wt+1=wt-α* (∂f/∂w), we get the result that w changes very slowly, like a snail, very slowly to our optimal point w* crawl, that in the finite iteration time, it is far from our optimal point.

So just by convex nature does not guarantee that the gradient drop and the finite number of iterations in the case of the point W will be a good approximation of the global minimum point w* (in the words, there is a place to say, in fact, let the iteration near the best place to stop, is also a rule or improve the generalization performance method). As analyzed above, if f (w) is very flat around the global minimum point w*, we may find a far point. But if we have "strong convex", we can control the situation, we can get a better approximate solution. As for how good, there is a bound, this bound is also dependent on the quality of strongly convex the nature of the constant α size. See here, do not know that we learn to be smart No. What do you do if you want to get strongly convex? The simplest thing is to add an item (Α/2) *| | w| | 2.

Well, it took so much space to tell a strongly convex. In fact, in gradient descent, the upper bound of the convergence rate of the objective function is actually related to the condition number of the matrix XTX, the smaller the XTX condition number, the smaller the upper bound, that is, the faster the convergence.

This one optimizes to say so many things. One sentence to summarize: the L2 norm not only prevents overfitting, but also allows our optimization solution to be stable and fast.

OK, here to honor the above commitments, to intuitively talk about the difference between L1 and L2, why a let the absolute value of the smallest, a minimum of the square, there will be so big difference? I have seen two geometrically intuitive parsing:

1) Descent speed:

We know that both L1 and L2 are in a regular way, and we put the weights in L1 or L2 way into the cost function. The model then tries to minimize these weight parameters. And this minimization is like a downhill process, the difference between L1 and L2 is that the "slope" is different, such as: L1 is the absolute function of the "slope" down, and L2 is two times the function of the "slope" down. So in fact, around 0, the L1 is falling faster than the L2. So it will fall to 0 very quickly. But I think the explanation here is not very pertinent, of course, do not know whether they understand the problem.

2) Limitations of model space:

In fact, for L1 and L2 the cost function of the rule, we can write the following form:

In other words, we limit the model space to a l1-ball of W. To facilitate visualization, we consider a two-dimensional case where the contour of the objective function can be drawn on the (W1, W2) plane, while the constraint becomes a norm ball with a radius of C on the plane. The best solution is where the contour line intersects with the norm Ball for the first time:

As you can see, the difference between the L1-ball and the L2-ball is that the L1 has "horns" in place where each axis intersects, and that the geodesic of the objective function will intersect at the corner most of the time unless the position is very well placed. Notice that the position of the corner will be sparse, the intersection point in the example has w1=0, and the higher dimension (imagine what the three-dimensional l1-ball is?). In addition to the corners, there are many sides of the contour is also a large probability of becoming the first intersection of the place, and will produce sparsity.

By contrast, L2-ball has no such property, because there is no angle, so the probability that the first intersection occurs in a sparse position becomes very small. This intuitively explains why L1-regularization can produce sparsity, and the reason why l2-regularization does not work.

Therefore, one sentence summary is: L1 will tend to produce a small number of features, while the other features are 0, and L2 will choose more features, these features will be close to 0. Lasso is very useful in feature selection, and ridge is just a rule.

Iv. nuclear Norm

Nuclear norm | | w| | * refers to the matrix singular value and, in English, called Nuclear Norm. This is relative to the above fiery L1 and L2, perhaps everyone will be unfamiliar point. Then why is it used? Domineering debut: constrained Low-rank (low rank). Ok,ok, then we need to know what Low-rank is? What do you use it for?

For the above linear equations, the first equation and the second equation have different solutions, and the solution of the 2nd equation and the 3rd equation is exactly the same. In this sense, the 3rd equation is "superfluous", because it does not bring any amount of information, remove it, the resulting equations and the original equations of the same solution. In order to remove the redundant equations from the equations, the concept of "rank of matrix" is derived naturally.

Remember how we manually asked for the rank of the matrix? In order to find the rank of matrix A, we use the matrix Elementary transformation to turn A into a ladder matrix, if the ladder matrix has r non 0 rows, then rank rank (a) of a is equal to R. In the physical sense, the rank of matrix is the correlation between the ranks of matrices. If each row or column of a matrix is linearly independent, the matrix is full-rank, that is, the rank equals the number of rows. Go back to the equation group above, because the linear equations can be described by a matrix. Rank indicates how many useful equations there are. The above equations have 3 equations, actually only 2 are useful and one is superfluous, so the rank of the corresponding matrix is 2.

Ok. Since rank can measure correlations, the correlation of matrices actually has structural information with matrices. If the correlation between the rows of the matrix is strong, then it means that the matrix can actually be projected into the lower dimension of the linear subspace, that is, with a few vectors can be fully expressed, it is low-rank. So what we have summed up is this: if the matrix is expressing structural information, examples, user-recommendation tables, and so on, there is a certain correlation between the rows of the matrix, which is generally low-rank.

If X is an M-row N-column numerical matrix, rank (x) is the rank of x, and if rank (x) is much smaller than M and N, then we call x a low-rank matrix. Each row or column of a low-rank matrix can be linearly listed with other rows or columns, and it is visible that it contains a large amount of redundant information. With this redundant information, the missing data can be recovered, and the data can be feature extracted.

Well, the low rank has, that constrained low rank is only constrained rank (W) Ah, and our this section of the nuclear norm what is the relationship? Their relationship is the same as L0 's relationship with L1. Because rank () is non-convex and difficult to solve in optimization problems, it is necessary to look for its convex approximation to approximate it. Yes, you're right. The convex approximation of rank (W) is the nuclear norm | | w| | *。

Well, here, I have nothing to say, because I also looked at this thing a little bit, so I have not gone into the depth to see it. But I've found that there are a lot of interesting applications here, so let's take a couple of examples.

1) matrix Fill (completion):

Let's start by talking about where the matrix fills. A mainstream application is in the Recommender system. We know that the recommendation system has a way of analyzing the user's history to recommend to the user. For example, when we are watching a movie, if we like to see it, we will give it a score, such as 3 stars. Then the system, such as Netflix and other well-known sites, will analyze the data to see exactly what the theme of each film is? For everyone, likes what kind of movie, then will give the corresponding user to recommend similar subject film. But there is a problem: our site has a lot of users, there are a lot of movies, not all users have seen the movie, not all the users who have seen a movie will give it a rating. Suppose we use a "user-movie" matrix to describe these records, for example, you can see that there will be a lot of whitespace. If these gaps exist, it is difficult to analyze the matrix, so before the analysis, it is generally necessary to complete the first. Also called Matrix padding.

So how do you fill it? How can we make it out of nothing? Is the information contained in each blank place on top of other existing information? If so, how to extract it? Yeah, this is where the low-rank comes into effect. This is called low-rank matrix reconfiguration, which can be expressed in the following model: The known data is a given m*n matrix A, if some of these elements are lost for some reason, can we restore the elements based on the elements of other rows and columns? Of course, if there are no other reference conditions, it is difficult to determine the data. But if we know rank rank (a) <<m and rank (a) <<n, then we can find the missing element by the linear correlation between the rows (columns) of the matrix. You would ask, is it reasonable to assume that the matrix we are restoring is low-rank? It's actually quite reasonable, for example, that a user rating a movie is a linear combination of other users ' ratings for the movie. So, with low rank refactoring, you can predict how much your users will like their non-rated videos. The matrix is then populated.

2) Robust PCA:

Principal component analysis, this method can effectively find the most "main" elements and structures in the data, remove noise and redundancy, reduce the complexity of the original data, and reveal the simple structure hidden behind the complex data. We know that the simplest method of principal component analysis is PCA. From the perspective of linear algebra, the goal of PCA is to use another set of bases to re-describe the resulting data space. It is hoped that under this new group, the relationship between the existing data can be revealed as much as possible. This dimension is the most important "principal element". The goal of PCA is to find such "main elements" to maximize the removal of redundancy and noise disturbances.

Robust principal component Analysis (robust PCA) considers the problem that our data matrix X contains structural information and noise. Then we can decompose this matrix into two matrices, one of which is low rank (due to the internal structure information, resulting in linear correlation between rows or columns), and the other is sparse (due to the noise, and the noise is sparse), then the robust principal component analysis can be written as the following optimization problems:

As with the classical PCA problem, robust PCA is essentially the best projection problem for finding data in low-dimensional space. For low-rank data observation matrix X, if x is affected by random (sparse) noise, the low rank of x is destroyed and the x becomes full-rank. So we need to decompose X into the sum of the low-rank matrices and sparse noise matrices that contain their true structure. The low-rank matrix is found, and the intrinsic low-dimensional space of the data is actually found. With PCA, why do you have this robust PCA? Where's robust? Because PCA assumes that the noise of our data is Gaussian, the PCA will be affected by it for large noise or serious outliers, resulting in a failure to work properly. The robust PCA does not have this hypothesis. It just assumes that its noise is sparse, regardless of the strength of the noise.

Since rank and L0 norm have non-convex and non-smooth characteristics in optimization, we generally convert it to solve a convex optimization problem of the following relaxation:

Let's say an application. Considering the multiple images of the same pair of faces, if each face image is regarded as a row vector and the vectors are formed into a matrix, then it is certain that the matrix should be low-rank in theory. However, due to the actual operation, each image will be affected by a certain degree, such as occlusion, noise, illumination changes, translation and so on. The effects of these interfering factors can be seen as a function of a noise matrix. So we can put together a number of different cases of our same personal face of the picture, and then put into a matrix, the matrix of low-rank and sparse decomposition, you can get clean face image (low rank matrix) and noise matrix (sparse matrix), such as light, occlusion and so on. As for the use of this, you know.

3) Background Modeling:

The simplest scenario for background modeling is to separate the background and foreground from the video captured by the fixed camera. We pull the image pixel values of each frame of the video image sequence into a single column vector, so that multiple frames, or multiple column vectors, form an observation matrix. Because the background is relatively stable, the image sequence frame and the frame have a great similarity, so the matrix composed only of the background pixels has low rank, and because the foreground is moving object, occupy the pixel proportion is low, so the foreground pixel composition matrix has sparse characteristic. The video observation matrix is the superposition of these two kinds of characteristic matrices, so the process of video background modeling is the process of low rank matrix recovery.

4) Transform invariant low rank texture (TILT):

In the above section, the low-rank approximation algorithm for images, which only considers the similarity of pixels between images, does not take into account the regularity of the image as a two-dimensional pixel set. In fact, for images that are not rotated, because of the symmetry and self-similarity of images, we can consider them as a low-rank matrix with noise. When the image is rotated by the upright, the symmetry and regularity of the image are destroyed, which means that the linear correlation between the lines of pixels is destroyed, so the rank of the matrix increases.

The low Rank texture mapping algorithm (transforminvariant Low-rank textures,tilt) is a low rank texture restoration algorithm with low rank and noise sparsity. Its idea is to adjust the image region represented by D to a regular region by geometric transformation, such as horizontal vertical and symmetry, which can be characterized by low rank property.

Low-rank applications are very much, we are interested to find some information in-depth understanding.

Five, the choice of the rule parameter

Now let's go back and look at our objective function:

In addition to the loss and rule items, there is also a parameter λ. It also has a domineering name, called Hyper-parameters (super-argument). You don't see it weak, it's very important. It takes a lot of time to determine the performance of our model, the model of life and death. It mainly balances the two items of loss and rule items, the larger the λ, the more important the rule term is than the model training error, that is to say, we would prefer our model to meet our constrained Ω (W) characteristics compared to the model fitting our data. Vice versa. To give an extreme situation, such as λ=0, there is no subsequent item, the minimization of the cost function all depends on the first item, that is, set to make the output and expected output difference is minimal, when the difference is minimal ah, of course, our function or curve can go through all points, this time the error is close to 0, that is, over fitting. It can be a complex representation or memory of all these samples, but not for a new sample generalization ability. After all, the new sample will be different from the training sample.

What do we really need? We want our model to fit both our data and the features we constrain it. Only the perfect combination of the two can make our model perform a powerful performance on our mission. So how to please it is very important. At this point, we may have a deep understanding. Remember that you reproduce a lot of papers, and then reproduce the code to run out of the accuracy of the paper is not so high, or even the poor miles. At this point, you will wonder whether it is the thesis or the problem you have achieved? In fact, in addition to these two questions, we need to think about another question: is the model presented in the paper hyper-parameters? Does the paper give the value of their experiment? Experience value or cross-validated value? This problem is not escaped, because almost any problem or model will have hyper-parameters, but sometimes it is hidden, you can not see it, but once you find, prove that you two destined, then try to modify it, there may be a "miracle" happen Oh.

ok, back to the problem itself. What is the target of the parameter λ we choose? We hope that the training error and generalization ability of the model are very strong. At this time, you may also reflect, this is not to say our generalization performance is our parameter λ function? So why do we choose λ that maximizes generalization performance by optimizing that set? Oh,sorry to tell you, because generalization performance is not a simple function of λ! It has a lot of local maximum value! And it has a large search space. So when you determine the parameters, one is to try a lot of experience, which is not the same as those who crawl in this field of the master. Of course, for some models, the Masters have also collated some of the tuning experience to us. For example Hinton Brother's that a practical guide to Training Restrictedboltzmann machines and so on. Another way is to choose by analyzing our models. How do you do it? Just before training, we probably calculate the value of the loss item at this time. What is the value of Ω (W)? Then to determine our λ for their proportions, this heuristic approach narrows our search space. The other most common approach is cross-validation of validation. First divide our training database into several, then take part of the training set, part of the test set, and then choose a different λ with this training set to train n models, and then use this test set to test our model, the N model of the test error of the smallest corresponding λ to be our final λ. If our model has a long training time, it is obvious that we can only test very little λ for a limited amount of time. For example, suppose our model needs to be trained for 1 days, which is commonplace in deep learning, and then we have one weeks, then we can only test 7 different λ. That's the best you'll ever get. That's the blessing of the last life. What's the way? Two: One is to try to test 7 more reliable λ, or lambda search space we try to be wide, so the general choice of Lambda search space is generally 2 of how many times Square, from 10 to 10 ah what. But this method is still not reliable, the best way to try to make our model training time to reduce. For example, let's say we optimized our model training to reduce our training time to 2 hours. So in one weeks we can train the model 7*24/2=84 times, which means we can find the best λ in 84 λ. This gives you the most chance to meet the best λ. This is why we have to choose optimization is the fast convergence speed of the algorithm, why use GPU, multicore, clustering, and so on model training, why the industry with a strong computer resources can do a lot of academia also do things (whenThe big data is a reason, too.

Try to be an "assistant" Master! I wish you all the best!

Vi. Reference Documents

Http://www.cnblogs.com/TenosDoIt/p/3708996.html?utm_source=tuicool&utm_medium=referral

http://fastml.com/large-scale-l1-feature-selection-with-vowpal-wabbit/

Http://www.stat.purdue.edu/~vishy/introml/notes/Optimization.pdf

Http://www.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf

Http://nm.mathforcollege.com/mws/gen/04sle/mws_gen_sle_spe_adequacy.pdf

On the rule norm in machine learning

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More