Rules for machine learning norms (two) preferences for nuclear power codes and rules
[Email protected]
Http://blog.csdn.net/zouxy09
On a blog post, we talked about L0. L1 and L2 norm. In this article, we ramble about the nuclear norm and rule term selection.
Knowledge is limited, and below are some of my superficial views, assuming that there are errors in understanding, I hope you will correct me. Thank you.
Third, nuclear norm
nuclear norm | | w| | * refers to the matrix of mysterious values and, in English called Nuclear Norm. This is relative to the above fiery L1 and L2, perhaps everyone will be unfamiliar point. Then why is it used? Domineering debut: constrained Low-rank (low rank). Ok,ok, then we need to know what Low-rank is? What do you use it for?
Let's take a look at what the "rank" is in linear algebra. Give me a simple example:
for the above linear equations, the first equation and the second equation have different solutions, and the 2nd equation and the 3rd equation are all the same. In this sense, the 3rd equation is "superfluous", because it does not bring no matter what the amount of information, remove it, the resulting equations and the original equations of the same solution. In order to remove the redundant equations from the equations, the concept of "rank of matrix" is derived naturally.
Remember how we manually asked for the rank of the matrix? In order to find the rank of matrix A, we transform a into a ladder matrix by elementary transformation of matrix, if the ladder type matrix has r non 0 rows. The rank rank (a) of a is equal to R. In the physical sense. The rank of the matrix is the correlation between the columns of the matrix. Assuming that each row or column of a matrix is linearly independent, the matrix is full-rank, that is, the rank equals the number of rows. Let's go back to the linear equation group above. Because the linear equations can be described by the matrix description. Rank indicates how many practical equations there are. The above equations have 3 equations, and actually only 2 are practical and one is superfluous. So the rank of the corresponding matrix is 2.
OK. Since rank is capable of measuring correlations, the correlation of matrices actually has structural information with matrices. Assuming that the correlation between the lines of the matrix is very strong, then it means that the matrix can actually be projected into the lower dimension of the linear subspace, that is, with a few vectors can be fully expressed, it is low-rank. So what we've summed up is that the hypothesis matrix expresses structural information. Image, user-referral table, etc. Then there is a certain correlation between the rows of the matrix, and the matrix is generally low-rank.
Suppose X is a numerical matrix of a M row n column, rank (x) is the rank of x. If rank (x) is much smaller than M and N, then we call X a low-rank matrix. Each row or column of a low-rank matrix can be linearly listed with other rows or columns, and it is visible that it includes a large amount of redundant information.
With this redundant information, the missing data can be recovered, and the data can be feature extracted.
It's all right. Low rank has, that constrained low rank is only constrained rank (W) Ah, and our this section of the nuclear norm what is the relationship? Their relationship is the same as L0 's relationship with L1. Because rank () is non-convex and difficult to solve in optimization problems, it is necessary to look for its convex approximation to approximate it.
Yes, you guessed it right. The convex approximation of rank (W) is the nuclear norm | | w| | *。
Well, here, I have nothing to say, because I also looked at the next thing slightly, so I have not gone into the depth to see it. But I've found that there are a lot of very interesting applications, and here are some typical ones.
1) matrix Fill (completion):
Let's start by talking about where the matrix fills. A mainstream application is in the Recommender system. We know that the recommendation system has a way of analyzing the user's history to recommend to the user. For example, when we are watching a movie, if we like to see it, we will give it a score, such as 3 stars. Then the system. such as Netflix and other well-known sites will analyze the data. What exactly is the subject matter of each film? For each individual. Like what kind of movie, then will give the corresponding user to recommend similar subject film. But there is a problem: we have a lot of users on the site, there are very many videos. Not all of the users have seen the movie, not all the users who have seen a movie will give it a rating. Let's say we use a "user-movie" matrix to describe these records, more than for example, to see. There will be a lot of blank space. Assuming these gaps exist, it is very difficult for us to analyze this matrix. So before the analysis. It is generally necessary to complete the completion of the first.
Also called Matrix padding.
watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvem91ehkwoq==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast "/>
So how exactly do you fill it? How can talent be out of nowhere? Is the information contained in each blank place on top of other existing information? Suppose there is. How to extract it? Yeah, this is where the low-rank comes into effect. This is called low-rank matrix reconfiguration, which can be expressed in the following model: The known data is a given m*n matrix A, and if some of the elements are lost for some reason, can we restore the elements according to the elements of other rows and columns? Of course, assuming there are no other criteria for the test, it is very difficult to determine the data. But assuming we know rank rank (a) <<m and rank (a) <<n, we can find the missing elements by linear correlation between the rows (columns) of the matrix. You would ask. This assumes that the matrix we are restoring is low-rank. Is it reasonable? is actually quite reasonable. For example, a user rating a movie is a linear combination of other users ' ratings for the movie.
As a result, low-rank refactoring allows you to predict how much your users will like their non-rated videos.
The matrix is then populated.
2) Robust PCA:
principal component analysis, such a method can effectively find the most "major" elements and structures in the data. Eliminate noise and redundancy and reduce the complexity of existing data. Uncover the simple structures hidden behind complex data. We know. The simplest method of principal component analysis is PCA. From the point of view of linear algebra. The goal of PCA is to use a set of data spaces to describe the narrative once again. It is hoped that under this new group, the relationship between the existing data can be revealed as much as possible.
This dimension is the most important "principal element". The goal of PCA is to find such "principal elements". Maximum removal of redundancy and noise disturbances.
Robust principal component analysis (robust PCA) is concerned with the question that our data matrix X will include structural information, as well as noise.
Then we can decompose this matrix into two matrices to add. One is a low rank (because there is some structural information inside, which causes the rows or columns to be linearly correlated). There is also a sparse (due to the noise contained. And the noise is sparse). Then robust principal component analysis can be written to the following optimization problems:
watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvem91ehkwoq==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast "/>
As with the classical PCA problem, robust PCA is essentially the best projection problem for finding data in low-dimensional space.
For a low-rank data-viewing matrix x, if x is affected by random (sparse) noise, the low rank of x is destroyed. Make x full-rank. So we need to decompose X into the sum of the low-rank matrices and sparse noise matrices that comprise their true structure. A low rank matrix was found. In fact, we find the essential low-dimensional space of data. With PCA, why do you have this robust PCA? Where's robust? Because the PCA assumes that the noise of our data is Gaussian. For large noise or severe outliers, the PCA will be affected by it, resulting in a failure to function properly. The robust PCA does not have this hypothesis. It merely assumes that its noise is sparse, regardless of the strength of the noise.
since rank and L0 norm have non-convex and non-smooth characteristics in optimization, we generally convert it to solve the following relaxation convex optimization problem:
Let's say an application. Considering the multiple images of the same pair of faces, assuming that each pair of face images are considered as a row vector and that the vectors are made up of a matrix, then you can be sure, theoretically. This matrix should be of low rank. However, due to the actual operation. Each image is affected to a certain extent, such as occlusion. Noise, illumination changes, panning, etc. The effects of these interfering factors can be seen as a function of a noise matrix. So we were able to lengthen one column of pictures of our same personal face in different situations and then put it into a matrix. The low-rank and sparse decomposition of this matrix allows for a clean face image (low-rank matrix) and a matrix of noise (sparse matrices), such as illumination. Occlusion and so on.
As for the use of this, you know.
3) Background Modeling:
The simplest scenario for background modeling is to separate the background and foreground from the video captured by the fixed camera. We pull the image pixel values of each frame of the video image sequence into a single column vector, so that multiple frames, or multiple column vectors, form a matrix of observations.
Because the background is relatively stable. There is a great similarity between the frame of the image sequence and the frame. So a matrix consisting only of background pixels has low rank characteristics, and at the same time, because the foreground is a moving object, the occupied pixel ratio is lower. Therefore, the matrix of foreground pixels has sparse characteristics. The video observation matrix is the superposition of these two kinds of characteristic matrices, so the process of the realization of video background modeling is the process of low rank matrix recovery.
watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvem91ehkwoq==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast "/>
4) Transform invariant low rank texture (TILT):
The low-rank approximation algorithm for images described in the above section considers only the similarity of pixels between image samples, but does not take into account the image as a two-dimensional pixel set. The regularity of its own. In fact, for an image that is not rotated, we can think of it as a low-rank matrix with noise due to the symmetry of the image and its self-similarity. When the image is rotated by the upright, the symmetry and regularity of the image are destroyed, which means that the linear correlation between the lines of pixels is destroyed, so the rank of the matrix is added.
low Rank texture mapping algorithm (transforminvariant low-rank textures. TILT) is a low-rank texture recovery algorithm with low rank and noise sparsity.
Its idea is to adjust the image region represented by D to a regular region by geometric transformation, such as horizontal vertical and symmetry. These features can be characterized by low rank properties.
watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvem91ehkwoq==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast "width=" 620 "/>
Low-rank applications are very numerous. We are interested to be able to find some information in-depth understanding.
Four, the choice of the rule of the number of parameters
Now let's look back at our objective function:
watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvem91ehkwoq==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma==/ Dissolve/70/gravity/southeast "/>
In addition to the loss and rules, there is also a number of parameters λ. It also has a domineering name, called Hyper-parameters (super-participation). You don't see it weak, it's very important. It takes a very large value to determine the performance of our model. It's about model life and death. It is mainly balanced between the two items of loss and rule terms, the larger the λ, the more important the rule term is than the model training error, that is, compared to the model to fit our data. We would prefer our model to meet the properties of our constrained Ω (w). Vice versa. To give an extreme situation. For example, when λ=0, there is no one in the back. The minimization of the cost function all depends on the first item, that is, set to make the output and expected output difference between the smallest, when the difference between the smallest Ah, of course, our function or curve can go through all the points, this time the error is close to 0, that is, over-fitting. It can be a complex representation or memory of all these samples, but not for a new sample generalization ability. After all, the new sample will be different from the training sample.
So what do we really need? We want our model to fit both our data and the features we constrain it.
With just the perfect combination of both, talent allows our models to perform a powerful performance on our mission.
So how to please it. is very important. At this point. You may have a deep understanding.
Remember that you reproduce a lot of papers, and then reproduce the code to run out of the accuracy rate is not as high as the paper said. It's even worse than miles. This time You wonder if it's a question of a paper or a problem you've achieved? In fact, in addition to these two questions, we still need to think deeply about the question: is the model presented in the paper hyper-parameters? Does the paper give the value of their experiment? Experience value or cross-validated value? This problem is not escaped, because almost no matter what a problem or model will have hyper-parameters, just sometimes it is hidden, you can not see it, but once you find, prove that you two destined. Then please try to change it, there may be a "miracle" happened Oh.
OK, back to the question itself. What is the target of our choice of parameters λ? We hope that the training error and generalization ability of the model are very strong.
At this time, you may also reflect. This is not to say that our generalization performance is our function of the parameter λ? So why do we choose λ that maximizes generalization performance by optimizing that set? Oh,sorry to tell you, because generalization performance is not a simple function of λ! It has a very large number of local maximum values! And its search space is very large. So when it comes to determining the number of participants, one is to try a lot of experience, which is no better than a master who crawls and rolls in this field. Of course, for some of the models, the Masters have also collated some of their experience to us.
For example Hinton Brother's that a practical guide to Training Restrictedboltzmann machines and so on. Another way is to choose by analyzing our models.
How do you do it? Just before training, we probably calculate the value of the loss item at this time. What is the value of Ω (W)? Then determine our λ for their proportions. Such a revealing approach would narrow our search space. The second most common approach is to cross-validate the crosses validation.
Divide our training database into several, then take part in the training set. Part of the test set, and then choose a different λ use this training set to train N models. Then we use this test set to test our model, take the test error of the N model to the minimum corresponding λ as we finally λ.
Let's say our model has a very long training time. So very obviously within a limited time. We can only test very little λ. For example, suppose our model needs to be trained for 1 days, which is commonplace in deep learning, and then we have one weeks. Then we can only test 7 different λ.
That's the best you'll ever get. That's the blessing of the last life.
What's the way? Two kinds: One is to try to test 7 more reliable λ, or lambda search space we try to be wide, so the general choice of Lambda search space is generally 2 of the number of square. From 10 to 10 ah or something. But this approach is still not reliable, the best way to try to make our model training time to reduce. Suppose, for example, that we optimized our model training so that our training time was reduced to 2 hours. So in one weeks we'll be able to train the model 7*24/2=84 times. Other words. We are able to find the best λ in 84 λ.
This gives you the most chance to meet the best λ.
That's why we have to choose optimization, the fast convergence rate, why we use GPUs, multicore, clusters, and so on for model training, and why industry with powerful computer resources can do a lot of things that academics can't do (and of course, big data is a reason).
try to be a "tune-up" Master!
I wish you all the best!
V. References
[1] http://fastml.com/large-scale-l1-feature-selection-with-vowpal-wabbit/
[2] Http://www.stat.purdue.edu/~vishy/introml/notes/Optimization.pdf
[3] Http://www.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf
[4] gradientdescent, Wolfe ' s Condition and Logistic Regression
[5] Http://nm.mathforcollege.com/mws/gen/04sle/mws_gen_sle_spe_adequacy.pdf
Copyright notice: This article Bo Master original articles, blogs, without consent may not be reproduced.
Rules for machine learning norms (two) preferences for nuclear power codes and rules