Norm rule in machine learning (II.) kernel norm and rule item parameter selection very good, must see

Source: Internet
Author: User

Norm rule in machine learning (II.) kernel norm and rule item parameter selection

[Email protected]

Http://blog.csdn.net/zouxy09

In the previous blog post, we talked about the l0,l1 and L2 norm, which we ramble about in terms of nuclear norm and rule parameter selection. Knowledge is limited, the following are some of my superficial views, if the understanding of the error, I hope that everyone to correct. Thank you.

Third, nuclear norm

Nuclear norm | | w| | * refers to the matrix singular value and, in English, called Nuclear Norm. This is relative to the above fiery L1 and L2, perhaps everyone will be unfamiliar point. Then why is it used? Domineering debut: constrained Low-rank (low rank). Ok,ok, then we need to know what Low-rank is? What do you use it for?

Let's recall what is "rank" in linear algebra. Let me give you a simple example:

For the above linear equations, the first equation and the second equation have different solutions, and the solution of the 2nd equation and the 3rd equation is exactly the same. In this sense, the 3rd equation is "superfluous", because it does not bring any amount of information, remove it, the resulting equations and the original equations of the same solution. In order to remove the redundant equations from the equations, the concept of "rank of matrix" is derived naturally.

Remember how we manually asked for the rank of the matrix? In order to find the rank of matrix A, we use the matrix Elementary transformation to turn A into a ladder matrix, if the ladder matrix has r non 0 rows, then rank rank (a) of a is equal to R. In the physical sense, the rank of matrix is the correlation between the ranks of matrices. If each row or column of a matrix is linearly independent, the matrix is full-rank, that is, the rank equals the number of rows. Go back to the equation group above, because the linear equations can be described by a matrix. Rank indicates how many useful equations there are. The above equations have 3 equations, actually only 2 are useful and one is superfluous, so the rank of the corresponding matrix is 2.

Ok. Since rank can measure correlations, the correlation of matrices actually has structural information with matrices. If the correlation between the rows of the matrix is strong, then it means that the matrix can actually be projected into the lower dimension of the linear subspace, that is, with a few vectors can be fully expressed, it is low-rank. So what we have summed up is this: if the matrix is expressing structural information, examples, user-recommendation tables, and so on, there is a certain correlation between the rows of the matrix, which is generally low-rank.

If X is an M-row N-column numerical matrix, rank (x) is the rank of x, and if rank (x) is much smaller than M and N, then we call x a low-rank matrix. Each row or column of a low-rank matrix can be linearly listed with other rows or columns, and it is visible that it contains a large amount of redundant information. With this redundant information, the missing data can be recovered, and the data can be feature extracted.

Well, the low rank has, that constrained low rank is only constrained rank (W) Ah, and our this section of the nuclear norm what is the relationship? Their relationship is the same as L0 's relationship with L1. Because rank () is non-convex and difficult to solve in optimization problems, it is necessary to look for its convex approximation to approximate it. Yes, you're right. The convex approximation of rank (W) is the nuclear norm | | w| | *。

Well, here, I have nothing to say, because I also looked at this thing a little bit, so I have not gone into the depth to see it. But I've found that there are a lot of interesting applications here, so let's take a couple of examples.

1) matrix Fill (completion):

Let's start by talking about where the matrix fills. A mainstream application is in the Recommender system. We know that the recommendation system has a way of analyzing the user's history to recommend to the user. For example, when we are watching a movie, if we like to see it, we will give it a score, such as 3 stars. Then the system, such as Netflix and other well-known sites, will analyze the data to see exactly what the theme of each film is? For everyone, likes what kind of movie, then will give the corresponding user to recommend similar subject film. But there is a problem: our site has a lot of users, there are a lot of movies, not all users have seen the movie, not all the users who have seen a movie will give it a rating. Suppose we use a "user-movie" matrix to describe these records, for example, you can see that there will be a lot of whitespace. If these gaps exist, it is difficult to analyze the matrix, so before the analysis, it is generally necessary to complete the first. Also called Matrix padding.

So how do you fill it? How can we make it out of nothing? Is the information contained in each blank place on top of other existing information? If so, how to extract it? Yeah, this is where the low-rank comes into effect. This is called low-rank matrix reconfiguration, which can be expressed in the following model: The known data is a given m*n matrix A, if some of these elements are lost for some reason, can we restore the elements based on the elements of other rows and columns? Of course, if there are no other reference conditions, it is difficult to determine the data. But if we know rank rank (a) <<m and rank (a) <<n, then we can find the missing element by the linear correlation between the rows (columns) of the matrix. You would ask, is it reasonable to assume that the matrix we are restoring is low-rank? It's actually quite reasonable, for example, that a user rating a movie is a linear combination of other users ' ratings for the movie. So, with low rank refactoring, you can predict how much your users will like their non-rated videos. The matrix is then populated.

2) Robust PCA:

Principal component analysis, this method can effectively find the most "main" elements and structures in the data, remove noise and redundancy, reduce the complexity of the original data, and reveal the simple structure hidden behind the complex data. We know that the simplest method of principal component analysis is PCA. From the perspective of linear algebra, the goal of PCA is to use another set of bases to re-describe the resulting data space. It is hoped that under this new group, the relationship between the existing data can be revealed as much as possible. This dimension is the most important "principal element". The goal of PCA is to find such "main elements" to maximize the removal of redundancy and noise disturbances.

Robust principal component Analysis (robust PCA) considers the problem that our data matrix X contains structural information and noise. Then we can decompose this matrix into two matrices, one of which is low rank (due to the internal structure information, resulting in linear correlation between rows or columns), and the other is sparse (due to the noise, and the noise is sparse), then the robust principal component analysis can be written as the following optimization problems:

As with the classical PCA problem, robust PCA is essentially the best projection problem for finding data in low-dimensional space. For low-rank data observation matrix X, if x is affected by random (sparse) noise, the low rank of x is destroyed and the x becomes full-rank. So we need to decompose X into the sum of the low-rank matrices and sparse noise matrices that contain their true structure. The low-rank matrix is found, and the intrinsic low-dimensional space of the data is actually found. With PCA, why do you have this robust PCA? Where's robust? Because PCA assumes that the noise of our data is Gaussian, the PCA will be affected by it for large noise or serious outliers, resulting in a failure to work properly. The robust PCA does not have this hypothesis. It just assumes that its noise is sparse, regardless of the strength of the noise.

Since rank and L0 norm have non-convex and non-smooth characteristics in optimization, we generally convert it to solve a convex optimization problem of the following relaxation:

Let's say an application. Considering the multiple images of the same pair of faces, if each face image is regarded as a row vector and the vectors are formed into a matrix, then it is certain that the matrix should be low-rank in theory. However, due to the actual operation, each image will be affected by a certain degree, such as occlusion, noise, illumination changes, translation and so on. The effects of these interfering factors can be seen as a function of a noise matrix. So we can put together a number of different cases of our same personal face of the picture, and then put into a matrix, the matrix of low-rank and sparse decomposition, you can get clean face image (low rank matrix) and noise matrix (sparse matrix), such as light, occlusion and so on. As for the use of this, you know.

3) Background Modeling:

The simplest scenario for background modeling is to separate the background and foreground from the video captured by the fixed camera. We pull the image pixel values of each frame of the video image sequence into a single column vector, so that multiple frames, or multiple column vectors, form an observation matrix. Because the background is relatively stable, the image sequence frame and the frame have a great similarity, so the matrix composed only of the background pixels has low rank, and because the foreground is moving object, occupy the pixel proportion is low, so the foreground pixel composition matrix has sparse characteristic. The video observation matrix is the superposition of these two kinds of characteristic matrices, so the process of video background modeling is the process of low rank matrix recovery.

4) Transform invariant low rank texture (TILT):

In the above section, the low-rank approximation algorithm for images, which only considers the similarity of pixels between images, does not take into account the regularity of the image as a two-dimensional pixel set. In fact, for images that are not rotated, because of the symmetry and self-similarity of images, we can consider them as a low-rank matrix with noise. When the image is rotated by the upright, the symmetry and regularity of the image are destroyed, which means that the linear correlation between the lines of pixels is destroyed, so the rank of the matrix increases.

The low Rank texture mapping algorithm (transforminvariant Low-rank textures,tilt) is a low rank texture restoration algorithm with low rank and noise sparsity. Its idea is to adjust the image region represented by D to a regular region by geometric transformation, such as horizontal vertical and symmetry, which can be characterized by low rank property.

Low-rank applications are very much, we are interested to find some information in-depth understanding.

Four, the choice of the rule parameter

Now let's go back and look at our objective function:

In addition to the loss and rule items, there is also a parameter λ. It also has a domineering name, called Hyper-parameters (super-argument). You don't see it weak, it's very important. It takes a lot of time to determine the performance of our model, the model of life and death. It mainly balances the two items of loss and rule items, the larger the λ, the more important the rule term is than the model training error, that is to say, we would prefer our model to meet our constrained Ω (W) characteristics compared to the model fitting our data. Vice versa. To give an extreme situation, such as λ=0, there is no subsequent item, the minimization of the cost function all depends on the first item, that is, set to make the output and expected output difference is minimal, when the difference is minimal ah, of course, our function or curve can go through all points, this time the error is close to 0, that is, over fitting. It can be a complex representation or memory of all these samples, but not for a new sample generalization ability. After all, the new sample will be different from the training sample.

        What do we really need? We want our model to fit both our data and the features we constrain it. Only the perfect combination of the two can make our model perform a powerful performance on our mission. So how to please it is very important. At this point, we may have a deep understanding. Remember that you reproduce a lot of papers, and then reproduce the code to run out of the accuracy of the paper is not so high, or even the poor miles. At this point, you will wonder whether it is the thesis or the problem you have achieved? In fact, in addition to these two questions, we need to think about another question: is the model presented in the paper hyper-parameters? Does the paper give the value of their experiment? Experience value or cross-validated value? This problem is not escaped, because almost any problem or model will have hyper-parameters, but sometimes it is hidden, you can not see it, but once you find, prove that you two destined, then try to modify it, there may be a "miracle" happen Oh.

       ok, back to the problem itself. What is the target of the parameter λ we choose? We hope that the training error and generalization ability of the model are very strong. At this time, you may also reflect, this is not to say our generalization performance is our parameter λ function? So why do we choose λ that maximizes generalization performance by optimizing that set? Oh,sorry to tell you, because generalization performance is not a simple function of λ! It has a lot of local maximum value! And it has a large search space. So when you determine the parameters, one is to try a lot of experience, which is not the same as those who crawl in this field of the master. Of course, for some models, the Masters have also collated some of the tuning experience to us. For example Hinton Brother's that a practical guide to Training Restrictedboltzmann machines and so on. Another way is to choose by analyzing our models. How do you do it? Just before training, we probably calculate the value of the loss item at this time. What is the value of Ω (W)? Then to determine our λ for their proportions, this heuristic approach narrows our search space. The other most common approach is cross-validation of validation. First divide our training database into several, then take part of the training set, part of the test set, and then choose a different λ with this training set to train n models, and then use this test set to test our model, the N model of the test error of the smallest corresponding λ to be our final λ. If our model has a long training time, it is obvious that we can only test very little λ for a limited amount of time. For example, suppose our model needs to be trained for 1 days, which is commonplace in deep learning, and then we have one weeks, then we can only test 7 different λ. That's the best you'll ever get. That's the blessing of the last life. What's the way? Two: One is to try to test 7 more reliable λ, or lambda search space we try to be wide, so the general choice of Lambda search space is generally 2 of how many times Square, from 10 to 10 ah what. But this method is still not reliable, the best way to try to make our model training time to reduce. For example, let's say we optimized our model training to reduce our training time to 2 hours. So in one weeks we can train the model 7*24/2=84 times, which means we can find the best λ in 84 λ. This gives you the most chance to meet the best λ. This is why we have to choose optimization is the fast convergence speed of the algorithm, why use GPU, multicore, clustering, and so on model training, why the industry with a strong computer resources can do a lot of academia also do things (whenThe big data is a reason, too.

Try to be an "assistant" Master! I wish you all the best!

V. References

[1] http://fastml.com/large-scale-l1-feature-selection-with-vowpal-wabbit/

[2] Http://www.stat.purdue.edu/~vishy/introml/notes/Optimization.pdf

[3] Http://www.stanford.edu/~boyd/cvxbook/bv_cvxbook.pdf

[4] gradientdescent, Wolfe ' s Condition and Logistic Regression

[5] Http://nm.mathforcollege.com/mws/gen/04sle/mws_gen_sle_spe_adequacy.pdf

Norm rule in machine learning (II.) kernel norm and rule item parameter selection very good, must see

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.