Directory

1 L2 Penalty items

1.1 Penalty items

1.2 L2 penalty Item and overfitting

1.3 Linear model with multi-objective value

2 characteristic value decomposition

3 Singular value decomposition

4 Summary

5 References

1 L2 Penalty Items1.1 Penalty Items

In order to prevent the world from being destroyed, in order to maintain world peace ... Sorry, this is the beginning of a deserted cavity board! Some of the cost functions of linear models include penalty items, and we learn from books or from experience that there are two main effects of penalty: in order to prevent the model from overfitting, in order to maintain the simplicity of the model. The common penalty items are L0, L1 and L2 penalty items, in which the L0 penalty is the number of components not 0 in the weight vector w, and the L1 penalty is the sum of the absolute values of the weight vector w components, both of which can well maintain the sparsity of the weighted W. When the L2 penalty is the modulus of the weight vector W, and the L2 penalty is the maximum value of the singular value of the weight matrix W, the L2 penalty can well prevent the model from overfitting. in the norm of machine learning (a) L0, L1, and L2 norm, the author intuitively explains why L1 is more advantageous in maintaining simplicity, while L2 in preventing overfitting.

Furthermore, the process of solving the linear model with penalty is essentially the maximum likelihood estimation of the solution containing the priori information. A linear model with L1 penalty, assuming that the weight vector w obeys the double exponential distribution, and the linear model containing the L2 penalty, assumes that the weight vector obeys the Gaussian distribution. In another blog post, I'll explain the secret of the mystery.

1.2 L2 penalty Item and overfitting

L0 penalty is the most primitive model of the simplicity of the expression, L1 and the single target value L2 penalty terms of the geometric significance of the more obvious, we also easily from a geometrical point of view to prevent overfitting or maintain the principle of simplicity. In this article, we focus primarily on L2 penalty items.

Over-fitting phenomenon, popular is that the model is too suitable for training data, and the performance of the expected data is not good phenomenon. However, a real fit is the result of a combination of data and models in two ways: data may not represent the whole or even have a larger difference in sampling, and a sufficiently complex model will produce a fitting phenomenon after training on such data. For example: In the whole, the characteristics of the I and the target are not very strong correlation (average), but the sample has to be strong correlation of the individuals pumped out, if the data on the training of the non-pruning decision tree model, it is difficult to predict the new data to make accurate judgments.

Single target value L2 penalty term is expressed as the modulus of weight vector W, when a single target value is added to the cost function of the linear model L2 the penalty term, on the one hand, in order to better conform to the training data, the essence of learning promotes the difference between the features, that is, the difference between the weights of the weight vector w increases; In order to satisfy the penalty, the modulus of the weight vector w must be limited to a certain range, which means that each component of the weight vector w is limited to a certain range, and the difference between the components is not too obvious. So we can use "indecisive" to describe the training process of the linear model with penalty.

However, the significance of the multi-objective value L2 penalty is not so well understood: what is the maximum value of the singular value of the weight matrix W?

1.3 Linear Model with multi-objective value

To know the significance of the multi-objective value L2 penalty, let us first know what is the linear model of multi-objective values? In simple terms, a multi-objective linear model is a combination of multiple linear models of single-target values (this is not nonsense ...). ), which means that the weight vector w becomes the weight matrix W, and the target value vector y becomes the target value Matrix Y. The sample size is M, the number of features is n, and the target number is L the multi-objective linear model is represented as follows:

From what we can see, the K-line vectors of the weights matrix and the sample's characteristic matrix X will generate the K-line vectors of the target-value matrix.

2 characteristic value decomposition

Let's simplify the model: set the target value L equals the sample capacity M. At this point, the weight matrix W becomes the M-order phalanx.

May for the credit, in order to take the postgraduate examination, we have learned how to carry on the characteristic value decomposition, also brushed many related exercises. However, there may be a large part of it that does not understand why eigenvalue decomposition, what is the geometrical meaning of it? First, let's go back to the essence, and get eigenvalues and eigenvectors from the definition in the following properties:

Eigenvector is a special set of vectors, which is transformed by the original matrix W (the weight matrix in this paper), and does not change the rest, only the size, and the degree of scaling is its corresponding eigenvalue size. In addition, we always find a set of M-linearly unrelated eigenvectors, which can be expressed as a single XJ:

In the definition of a linear model, we need to multiply the weight matrix w right by the characteristic matrix X of the sample, for the individual XJ:

It is not difficult to find that after the weighted matrix W right multiply the sample and the original sample, it only in the direction of the eigenvector is scaled, the degree of scaling is the corresponding eigenvalue size. From a geometric point of view, the matrix W right-multiply vector xj is essentially scaled in the M-dimensional space of the eigenvectors.

At this point, we look at what determines the target value of the individual XJ? If the absolute value of a characteristic is too large, the target value of the individual XJ is approximated to the corresponding telescopic eigenvector. The following 3-step examples are well illustrated:

There are 3 eigenvectors Q1, Q2, and Q3, corresponding eigenvalues are 1, 5, and 1. XJ is represented as (2,2,2), W*xj equals (2,6,6), and the target value approximates the result of a 5 times-fold elongation of the eigenvector Q2. By this example, we know that when the power value Matrix W is a square, the maximum value of the eigenvalue determines the bias of the target value (in favor of the corresponding telescopic eigenvector), so when the maximum value of the eigenvalue is large, then the sample to be predicted after the weight matrix W right Multiply, Will tend to the corresponding telescopic eigenvector, which has resulted in the phenomenon of overfitting: bias reflects the training data on the best efforts to meet, but it is not consistent with the actual situation.

Thus, when the power value matrix W is a phalanx, it is not unreasonable to select the maximum value of the eigenvalue as the multi-objective value L2 penalty.

3 Singular value decomposition

When the Power value matrix W is not a square, the eigenvalue decomposition cannot be performed, we can only perform singular value decomposition. By definition, we are aware of the following properties:

Above, V is the characteristic vector of eigenvalue decomposition after W squared (M-order), Lambda is the corresponding eigenvalue (singular value) and U is the L-dimension column vector. Unlike eigenvalue decomposition, the eigenvector Q becomes a v vector and a u vector. We can understand that through W right multiply, m-dimensional v vector its in the L-dimensional space of one by one corresponding to the U vector, will not occur in the direction of the change, just to scale. As a result, we can also re-represent and calculate the sample:

The same formula, or the familiar flavor, we can use the maximum value of singular value to represent the L2 penalty of any weight matrix W.

4 Summary

The derivation of the matrix problem is often started from the square and then to any matrix. Eigenvalue decomposition and singular value decomposition depict the transformation of a matrix to a vector (or matrix), the eigenvalues (singular values) depict the conversion dynamics, the eigenvector depicts the direction of transformation, the transformation of eigenvalue decomposition in the same space, and the transformation of singular value decomposition in two different spaces.

5 References

- Norm rule in machine learning (i) L0, L1 and L2 norm

Two or three things you may not know about linear models (three, the magic of eigenvalues and singular values)