Reprint Please specify source: http://www.cnblogs.com/ymingjingr/p/4271742.html
Directory machine Learning Cornerstone Note When you can use machine learning (1) Machine learning Cornerstone Note 2--When you can use machine learning (2) Machine learning Cornerstone Note 3--When you can use machine learning (3) (modified version) machine learning Cornerstone Notes 4-- When to use machine learning (4) Machine learning Cornerstone Note 5--Why machines can learn (1) machine learning Cornerstone Notes 6--Why machines can learn (2) machine learning Cornerstone Notes 7--Why machines can learn (3) machine learning Cornerstone Notes 8-- Why machines can learn (4) machine learning Cornerstone Note 9--machine how to learn (1) machine learning Cornerstone Note 10--machine how to learn (2) machine learning Cornerstone Note 11--machine how to learn (3) machine learning Cornerstone Note 12-- How machines can learn (4) machine learning Cornerstone Note 13--Machine How to learn better (1) machine learning Cornerstone Note 14--Machine How to learn better (2) machine learning Cornerstone Note 15--Machine How to learn better (3) machine learning Cornerstone Note 16-- How the machine can learn better (4) 14, regularization
Regularization of the.
14.1 regularized hypothesis Set
Regularization hypothesis.
In the previous chapter, five measures to prevent overfitting were mentioned, and this chapter will introduce one of the measures, regularization (regularization).
Regularization of the main idea: the assumption letter from the number of high-level polynomial to lower, as when driving the brakes, the speed is reduced, 14-1, the right image represents a high-level polynomial function, obviously produced an over-fitting phenomenon, and the expression of the left graph using the regularization of the lower function.
Fig. 14-1 regularization Fitting and overfitting
The high-order polynomial is known to contain a low-order polynomial, so the relationship between the higher-order function and the lower-order function is shown in 14-2, and the content of this chapter is how to lower the assumption function to the lower order, that is, how to return from the outer circle to the inner small circle.
Fig. 14-2 diagram of higher function and lower function
The term "regularization" is derived from the function approximation of the ill-posed problem (ill-posed problem), that is, there are multiple solutions in the approximation of the function, and the problem of how to choose the solution.
How to reduce the time? This issue uses the knowledge of the polynomial conversions and linear regression mentioned in the previous chapters to convert the reduced-time problem to a problem with restricted (constraint) conditions. The following is an example of 10-time polynomial and two-time examples to understand regularization, assuming that the expression of W is as Formula 14-1 and Equation 14-2, respectively.
(Equation 14-1)
(Equation 14-2)
Equation 14-2 can be expressed using equation 14-1 with the following constraints,
Therefore, the hypothetical space of the 10-time polynomial and the smallest expression are respectively as Equation 14-3 and equation 14-4.
(Equation 14-3)
(Equation 14-4)
Through the above conclusions, the 2-dimensional hypothesis space and the smallest expression are respectively as Equation 14-5 and equation 14-6.
(Equation 14-5)
(Equation 14-6)
If the condition is designed more loosely, it is represented as a form, as shown in Equation 14-7.
(Equation 14-7)
So the problem of optimization is shown in Equation 14-8.
(Equation 14-8)
The relationship between this hypothetical space and, as shown in Equation 14-9.
(Equation 14-9)
Suppose that space is also known as the hypothetical space for sparse (sparse) because many parameters are 0. Note that the function in equation 14-8 limits indicates that the optimization problem is a NP-hard problem. Therefore, it is necessary to continue to improve the hypothetical function, which generates the hypothetical space as shown in Equation 14-10.
(Equation 14-10)
Suppose that the problem with space optimization is shown in Equation 14-11.
(Equation 14-11)
With overlapping parts, but not exactly the same. With the increase of C, the hypothetical space is also increasing, which can be obtained as shown in Equation 14-12.
(Equation 14-12)
The hypothetical space is called a regularization hypothesis space, that is, the hypothetical space of the hypothesis restriction condition. The best assumptions in the regularization hypothesis space are represented by symbols.
14.2 Weight Decay Regularization
The weight attenuation regularization.
In order to express the simplicity, the previous section of the optimization equation 14-11 is written as a vector matrix, as shown in Equation 14-13.
(Equation 14-13)
In a word, the Lagrange function is often used to explain the optimization problem with restrictive conditions, and Lin explains the factors behind the Lagrange multiplier in more depth.
First of all, to draw the optimization of the constraints, the blue part of the picture is, the red part is the restriction, from the expression formula is not difficult to draw the two is an ellipse, a circle (in the high-dimensional space Chinese super-sphere).
Figure 14-4 Optimization with limiting conditions
From the previous chapters, we learned that the inverse direction of the available gradients in the solution is minimized, which is as a descent direction, but there are some differences from the regression problem, there are more restrictions, so the direction of descent is not beyond the limit of the range, 14-3 of the red vector is to limit the ball tangent of the normal vector, Falling in this direction is beyond the limit, so you can only scroll along the direction of the ball tangent, and the green vector in 14-3. When will it be minimized? That is, the actual scrolling direction (the blue vector in the picture) does not have the same component as the ball tangent direction, in other words, the normal vector w of the ball tangent is parallel, as shown in Equation 14-14, which represents the regularization of the optimal solution.
(Equation 14-14)
The addition of Lagrange multipliers can be written in the form of equations, such as Equation 14-15.
(Equation 14-15)
Substituting the expression obtained in the linear regression (the derivation process in section 9.2) into Equation 14-15, the formula 14-16.
(Equation 14-16)
The expression is calculated as Equation 14-17.
(Equation 14-17)
It is a semi-positive definite, so long as the guarantee is positive definite matrix, must be reversible. The regression form is called Ridge regression (Ridge regression).
Do you remember the direct form of linear regression, as shown in Equation 14-18.
(Equation 14-18)
Equation 14-15 is made into an integral formula 14-19.
(Equation 14-19)
The problem of solving the minimum solution of equation 14-19 is equivalent to equation 14-19. Where the expression is known as an augmentation error (augmented errors), which is represented by a regularization item (Regularizer). The restrictive conditions mentioned in the previous section have been replaced with unrestricted conditions. In fact, the Lagrange function is used, but teacher Lin is pushing back the past, the reason is called augmentation error, because more than the traditional one more regularization. At or (the case is the solution of linear regression), the solution formula for the minimum w is shown in Equation 14-20.
(Equation 14-20)
Therefore, you do not need to give the parameter C contained in the conditional minimization problem in the previous section, but only the parameters in the augmented error.
The effect of observation parameters on the final result, 14-5.
Fig. 14-5 the effect of the parameters on the final obtained
At the time, over-fitting, with the constant increase into the less-fitted state. The larger corresponds to the shorter weight vector w, and also the smaller the constraint Radius c. (Remember how the less-fitting is handled in section 14.1?) Make c as narrow as possible, and accurately say look for small weights vector w), so this regularization of W becomes smaller, that is, the regularization called Weight decay (Weight-decay) regularization. This regularization can be combined with arbitrary conversion functions and arbitrary linear models.
Note: When you do a long-term conversion, assuming that the polynomial conversion function is on the higher term, the value is very small, in order to have a consistent influence on the weight vector component corresponding to the lower term, the weight of the item must be very large, but the regularization solution needs a particularly small weight vector w, so it is necessary for the transformed polynomial to be linearly independent, That is, the conversion function is, its various orthogonal basis functions (orthonormal basis functions), this polynomial is called the le-de polynomial (Legendre polynomials), the polynomial of the first 5 items shown in 14-6.
Figure 14-6 The first 5 expressions of the polynomial
14.3 regularization and VC theory
Regularization and the theory of VC.
This section describes the relationship between regularization and VC theory. That is, from the angle of the VC theory to explain why the regularization of the effect is good (section 14.1 from the angle of the cross-fitting to introduce the reason for good regularization).
Minimizing the band limit is equivalent to minimizing it, because parameter C is similar to the parameter. With the knowledge of section 7.4, the upper limit can be expressed as the form of Equation 14-21.
(Equation 14-21)
Therefore, the VC limit indirectly ensures that the minimization can be minimized.
Easy to observe the contrast, repeat the expression again, as in Equation 14-22.
(Equation 14-22)
A more general form of the upper limit can be written in Equation 14-23.
(14-23)
By comparing equation 14-22 with equation 14-23, it is easier to understand why minimizing can achieve better results than minimizing. As in Equation 14-22, the regularization term represents the complexity of a hypothetical function, whereas the complexity of the whole hypothetical space represented in equation 14-23 is better than the representation if (, which represents the complexity of the hypothesis).
The above is through the VC limit through a heuristic way to illustrate the advantages of regularization, and then through the VC dimension to explain the advantages of regularization.
The minimized form is written in Equation 14-24.
(Equation 14-24)
According to the theory of the seventh chapter, VC dimension, all the hypothetical functions are considered when solving the minimization. But because of parameter C or more directly, the limitation of the parameters is only actually considered. Therefore, the effective VC dimension is related to two parts: hypothetical space H and algorithm A. The fact that the VC dimension is small means that the complexity of the model is very low.
14.4 General Regularizers
A generalized regularization term.
The regularization term described in the previous sections of this chapter is a regularization term for weight decay (Weight-decay (L2) regularizer), or a L2 regularization item, in the form of a scalar, as a vector. So what should be the design of more general regularization items, or what is the design principle of generalized regularization items? Mainly divided into three points, as follows:
According to the objective function (target-dependent), which is to design the regularization term based on the properties of the objective function, for example, a target function is a symmetric function, so all the odd components of the weight vector should be suppressed, can be designed in the form, and increase in odd numbers;
It makes sense (plausible): regularization items should be as smooth as possible (smooth) or simple (simpler), because neither random noise nor deterministic noise is smooth. Smooth representations can be micro, such as L2. A simple representation is easy to solve, such as L1 regularization or sparse (sparsity) regularization:, described later;
Friendly: Easy to optimize the solution. such as L2.
Even if the design of the regularization item is not good, do not worry, because there is also a parameter, when it is 0 o'clock, the regularization item does not work.
Recalling section 8.3, the error-weighted design principle, similar to this, is based on the user (user-dependent), which is well-spoken and friendly.
Therefore, the final augmentation error consists of the error function and the regularization item, as shown in Equation 14-25.
(Equation 14-25)
The above design principles are explained concretely by comparing two commonly used regularization items (L2 and L1).
L2 is shown in regularization 14-7, as in Equation 14-26.
Figure 14-7 L2 regularization
(Equation 14-26)
The regularization is a convex function and can be differentiated at every position, so it is relatively easy to calculate.
This paper introduces a new regularization term L1, with 14-8 of the regularization items as shown in Equation 14-27.
Figure 14-8 L1 Regularization Item
(Equation 14-27)
It is also a convex shape, but not all of the positions can be micro, such as the corner. Why become sparse? Assuming that the diamond is all non-zero components, the derivative vector is a vector of all 1 components. If the vector is not parallel to all 1, then the vector will always move along the diamond boundary to the vertex, so the optimal solution at the vertex, the optimal solution contains a component with a value of 0, so the solution is sparse and the computation speed is fast.
Before concluding this chapter, observe how the parameters are selected under different noise conditions. The objective function is designed into 15 polynomial functions, 14-9 means fixed deterministic noise, different random noises, the best choice of parameters, horizontal axis to indicate the choice of parameters, ordinate, where the bold point indicates the optimal value of the parameters in the case of noise. (here because it is to observe how to choose the parameters under different noises, the objective function is known, so it can be found that the reality is impossible, the next example is the same, no longer repeated interpretation)
Fig. 14-9 selection of parameters under different random noises
The objective function is designed into 15 polynomial functions, 14-10 means the fixed random noise, the best choice of parameters under different deterministic noises, the horizontal axis indicates the choice of parameters, and the ordinate means, in which the bold points indicate the optimal value of the parameters in the case of noise.
Fig. 14-10 selection of parameters under different deterministic noises
It is not difficult to draw from the above two figures, the greater the noise need more regularization, this is like the more bumpy road, the more need to brake the same. But a more important problem is not solved, that is, in the case of unknown noise, how to choose the parameters, this is the content of the next chapter.
Machine learning Cornerstone Note 14--Machine How to learn better (2)