On L0,L1,L2 norm and its application _l0

Source: Internet
Author: User
Tags square root

The transmission gate of the original text: a brief talk on L0,l1,l2 norm and its application

Talking about L0,L1,L2 norm and its application

In mathematical branches such as linear algebra, function analysis, a norm (Norm) is a function that gives each vector in a vector space (or matrix) in length or size. For a 0 vector, the length of the other is zero. Intuitively, the larger the norm of a vector or matrix is, the greater the vector or matrix is, we can say. Sometimes the norm has many more common names, such as absolute value is the norm of real or complex number in one-dimensional vector space, and Euclidean distance is also a norm.

A generalized definition of a norm: a real number set p≥1, P-norm defined as:

|| x| | p:= (∑I=1N∣∣XI∣∣P) 1p (1)

Here, when P=1, we call it taxicab Norm, also known as Manhattan Norm. The source is the distance that Manhattan's taxi drivers need to walk from one point to another in the boxy Manhattan street. That is, the L1 norm we are going to discuss. That represents the sum of the absolute values of all elements in a vector. And when p=2, it is our most common Euclidean norm. Also known as Euclidean distance. That is, the L2 norm we are going to discuss. When p =, because it no longer satisfies the triangular inequality, strictly speaking at this time P is not a norm, but many people still call the l0 norm. These three norms have many interesting features, especially the regularization (regularization) in machine learning and sparse coding (Sparse coding) have very interesting applications.

The following figure shows a visualization of the shape of a LP ball with the decrease of p.


1-l0 Norm

Although L0 strictly does not belong to the norm, we can use the equation to give the definition of l0-norm:

|| x| | 0:=0∑i=0nx0i‾‾‾‾‾‾⎷ (2)

The formula above is still not quite clear, 0 of the exponent and square root is strictly limited in terms of the establishment. Therefore, in practical applications, most people give the following alternative definitions:

|| x| | 0=# (i) withxi≠0 (3)

That represents the number of all non 0 elements in the vector. It is precisely this attribute of the L0 norm that makes it very suitable for the application of sparse coding and feature selection in machine learning. The least optimal sparse feature term is found by minimizing the L0 norm. Unfortunately, the problem of minimizing the L0 norm is NP-hard in practical applications. Therefore, in many cases, L0 optimization problems will be relaxe to higher dimensional norm problems, such as L1 norm, L2 norm minimization problem.

2-L1 Norm

For vector x, the L1 norm is defined as follows:

|| x| | 1:=∑i=1n∣∣xi∣∣ (4)

Its application scope is very extensive. For example, the sum of absolute differents,mean Absolute Error in computer vision is defined using the L1 paradigm.

The solution of the L1 optimization problem is sparse, which tends to select very few very large values and a lot of insignificant small values. and L2 optimization is more very few particularly large value, but also a lot of relatively small value, but it still has significant contribution to the optimal solution. But from the smoothness of the solution of the optimization problem, the optimal solution phase of the L1 norm is less than the L2 norm, but it is often the optimal solution, but the L2 solution is many, but more inclined to some local optimal solution.



However, since the L1 norm has no smooth function representation, it is very difficult to solve the L1 optimization problem at first, but with the advent of computer technology, many convex optimization algorithms are used to make the L1 optimization possible.

3-l2 Norm

Of course the norm is the most common, but also the most famous L2 norm Mo genus. Its application also covers almost every field of science and engineering. Define the formula as follows:

|| x| | 2:=∑i=1nx2i‾‾‾‾‾‾⎷ (5)

Also Euclidean Norm, if used to compute the difference between two vectors, that is Euclidean Distance.

The optimization problem of Euclidean norm can be expressed in the following formula:

min| | x| | 2subjecttoax=b (6)

With the help of Lagrange multipliers, we can solve the optimization problem. Derived from L2, we can also define infinite norm, that is l-infinity norm:

|| x| | ∞:=∞∑i=1nx∞i‾‾‾‾‾‾⎷ (7)

At first glance the formula above is still a bit tricky. We have a simple mathematical transformation, assuming that X_j is the largest element in the vector, then according to the infinitely large attribute, we can get:

X∞j>>x∞i∨j≠i

You know

∑i=1nx∞i=x∞j

Then, according to the definition of formula (7), we can get:

|| x| | ∞=∞∑i=1nx∞i‾‾‾‾‾‾⎷=∞x∞j‾‾‾√=∣∣xj∣∣

So we can say that l-infinity norm is the length of the largest element in the x vector.

|| x| | ∞=max (∣∣xj∣∣) (8)

4-The application of machine learning

I don't know how many people begin to understand these norms because of the regularization and feature selection in machine learning, at least I am. L0 norm itself is the most direct and ideal scheme of feature selection, but as mentioned above, it is not divided and difficult to optimize, so we use L1 to get the optimal convex approximation of L0 in practical application. L2 has more smooth properties than L1, and often has better predictive properties than L1 in model prediction. When confronted with two features that are helpful for predictions, L1 tends to choose a larger feature. and L2 more inclined to combine the two.

4-1 regularization

Regularization in machine learning is to prevent ill-posed problems or fitting problems by introducing additional information in the loss function. Generally these additional information is used to penalize the complexity of the model (Occam ' s razor). The general form is as follows:

Loss (x,y) =error (x,y) +α∣∣∣∣w∣∣∣∣ (9)

∣∣∣∣w∣∣∣∣ can choose L1 or L2 norm as penalty term, different model, its loss function is different, for linear regression, if the penalty choice L1, then we call Lasso return, and L2 is Ridge return. Here's a list of the regular loss functions in different models (from Andrew Ng's Machine learning course):

regularized Logistic Regression

J (θ) =−1m[∑i=imy (i) loghθ (x (i)) + (1−y (i)) log (1−hθ (x (i))]+λ2m∑j=1nθ2j

regularized Neural Network

J (θ) =−1m[∑i=im∑k=1ky (i) Klog (hθ (x (i))) K + (1−y (i)) log (1− (hθ (x (i)))]+λ2m∑l=1l−1∑i=1sl∑j=1sl+1 (Theta (L) JI) 2

Soft Margin SVM

12|w|2+c∑imax (0,1−yi (w⊺xi+b))

From the above, we can see that the commonly used regularization term is L2 norm, in addition to prevent the problem of fitting, there is another advantage is whether can improve the ill-posed (condition) problem. Especially when the training sample is very young compared with the feature number, its matrix is not full rank, often tend to have countless solutions, and is irreversible. Its condition num will be very large. On the one hand, the optimal value obtained from this is very unstable, often a small change of a characteristic variable will lead to a large deviation of the final result. In addition, it is very difficult to find the optimal solution through matrix inversion. For linear regression, the optimal analytic solution is:

wˆ= (XTX) −1xty

Plus L2 the regular item, it becomes:

w⋅= (xtx+λi) −1xty

Thus, direct inversion can be obtained, and condition number is improved.

For the unresolved solution, through iterative optimization algorithm, L2 regularization can accelerate the convergence speed by changing the objective function into λ-strongly convex (λ strong convex).

4-2 Bayesian Transcendental

From the point of view of Bayesian learning theory, the regularization term is equivalent to a priori function. That is, when you train a model, it is not enough to rely solely on the current training set data, and in order to achieve better predictive (generalization) effects, we should add a priori. And L1 is equivalent to setting a Laplacean transcendental, to select the map (maximum a posteriori) hypothesis. And L2 is similar to Gaussian transcendental. As shown in the following illustration:


It can be seen from the above figure that the L1 priori is good for the tolerate of large and small values, while L2 priori tends to homogenization of large and small values.

4-3 feature selection and sparse coding

In the machine learning community, the method of feature selection is usually divided into three kinds. One is a statistical method, in which features are screened and subsets selected as model inputs. such as statistical inference using the hypothesis test, p value. The other is to use some mature learning algorithm for feature selection, such as the use of information gain in decision tree to select features. Another is the automatic feature selection in the model algorithm. The L1 norm is a regularization term, and its feature selection map tends to be spiky, and the effective feature selection is realized.

Sparse coding is also a vector x that wants to express an input by looking for as few features as possible.

Mina (j) I,ϕi∑j=1m∣∣∣∣∣∣∣∣x (J) −∑i=1ka (j) Iϕi∣∣∣∣∣∣∣∣2+λ∑i=1ks (A (j) i)

Where ϕi is the base vector to look for, a (j) I is the weight of each base vector we want to optimize. The rightmost expression is its regularization penalty, which is also called the sparse cost. In practice we usually use the L1 norm.

5 References

[1.]wiki:norm.

[2.]rorasa ' s blog.

[3.]maxjax.

[4.] Norm normalization in machine learning.

[5.]difference between L1 and L2.

[6.]gradient-descent-wolfe-s-condition-and-logistic-regression.

Wen/willheng (author of Jane book)
Original link: http://www.jianshu.com/p/bf860ad177dd

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.