Children who are familiar with machine learning are aware that the optimization method is one of the most important topics, and the most common scenario is to use the derivative of the objective function to solve unconstrained optimization problems through multiple iterations. It is one of the necessary weapon of training model to realize simple and coding convenience.
2. Several mathematical concepts 1) gradients (first-order derivative)
Consider a mountain at (x1, x2) point height is f (x1, x2). So, the gradient direction of a point is in the steepest direction of the slope, and the size of the gradient tells us how steep the slope is. Note that gradients can also tell us not to change speed in other directions in the quickest direction (in two-dimensional cases, a circle tilted in the gradient direction is projected into an ellipse on the plane). For a scalar function that has n variables, that is, the function enters an n-dimensional vector and outputs a value, the gradient can be defined as:
2) Hesse matrix (second derivative)
The Hesse matrix is often applied to large-scale optimization problems solved by Newton's method (described later), the main form is as follows:
When f (x) is a two-time function, the gradient and Hesse matrices are easily obtained. The two-times function can be written in the following form:
where x is a column vector, A is an n-order symmetric matrix, B is an n-Willi vector, and C is a constant. The f (x) gradient is ax+b, and the Hesse matrix equals A.
3) Jacobi matrix
The Jacobi matrix is actually a gradient matrix of a vector-valued function, assuming that F:RN→RM is a function of converting from n-dimensional Euclidean to m-dimensional Euclidean space. This function consists of M real functions:. The partial derivative of these functions (if present) can form a matrix of M row n columns (m by N), which is called the Jacobian matrix:
a) if f (x) is a scalar function, then the Jacobian matrix is a vector, equal to the F (x) gradient, and the Hesse matrix is a two-dimensional matrix. If f (x) is a vector-valued function, then the Jacobi matrix is a two-dimensional matrix, and the Hesse matrix is a three-dimensional matrix.
b) The gradient is a special case of the Jacobian matrix, the Jacobian matrix of the gradient is the Hesse matrix (the relationship between the first-order bias and the second derivative).
3. Optimization method 1) Gradient descent
Gradient descent, also known as steepest descent, is a method to find the local optimal solution of function by using the first order gradient information, and it is also the simplest and most commonly used optimization method in machine learning. Gradient descent is one of the line search methods, the main iterative formula is as follows:
Which is the K iteration we choose the direction of the move, in the steepest descent, the direction of the movement is set to the negative direction of the gradient, is the K-time alternative line search method to select the distance traveled, each move can be the same distance factor, or can be different, sometimes we also called the Learning rate ( Learning rate). Mathematically, the moving distance can be zero by line search to find the minimum value in that direction, but in the actual programming process, the cost of this calculation is too high, we can generally set it to locate a constant. Consider a function that contains three variables, and the computed gradient is obtained. Set Learning rate = 1, the algorithm code is as follows:
# Code from Chapter to machine Learning:an algorithmic perspective # by Stephen Marsland (http://seat.massey.ac.nz /personal/s.r.marsland/mlbook.html) # Gradient descent using steepest descent from numpy import * def Jacobian (x): return Array ([x, 0.4*x, 1.2*x) def steepest (x0): i = 0 IMax = x = x0 Delta = 1 alpha = 1 while I<imax and delta>10** ( -5): p =-jacobian (x) xold = x x = x + Alph a*p Delta = SUM ((x-xold) **2) print ' epoch ', I, ': ' print x, ' \ n ' i + = 1 x0 = Array ([ -2,2,-2]) steepest (x0)
Steepest gradient method is the local optimal solution, if the objective function is a convex optimization problem, then the local optimal solution is the global optimal solution, the ideal optimization effect is the following figure, it is worth noting that each iteration of the direction of movement is perpendicular to the starting point of the contour:
It should be noted that, in some cases, the presence of a sawtooth phenomenon (zig-zagging) in the steepest descent method will result in slower convergence:
Roughly speaking, in the two function, the shape of the ellipsoid is affected by the condition number of the Hesse matrix, the direction of the minimum eigenvalue and the maximum eigenvalue of the corresponding matrix of the long axis and the short axis, whose size is inversely proportional to the square root of the eigenvalue, the greater the difference between the maximum eigenvalue and the minimum eigenvalue, the more flat the ellipsoid, The computational efficiency is very low. 2) Newton ' s method
In the steepest descent method, we can see that the method mainly uses the local property of the objective function and has a certain "blindness". Newton's law is to infer the shape of the whole objective function by using the local first order and second derivative information, then we can obtain the global minimum of approximate function, then set the minimum value of approximate function. Compared with the steepest descent method, Newton's method has a certain global predictability and better convergence properties. The main derivation process of Newton's method is as follows:
In the first step, we use Taylor series to obtain the second order approximation of the original objective function:
In the second step, X is considered as an argument, and all items with x^k are considered constants, so that the first derivative is 0 and the minimum value of the approximate function can be obtained:
In the third step, the current minimum value is set to the minimum value (or multiplied by the step) of the function.
and 1) the same as the optimization problem, the Newton method code is as follows:
# Code from Chapter to machine Learning:an algorithmic perspective # by Stephen Marsland (http://seat.massey.ac.nz /personal/s.r.marsland/mlbook.html) # Gradient descent using Newton ' s method from numpy import * def Jacobian (x): return Array ([x, 0.4*x, 1.2*x) def Hessian (x): return Array ([[1,0,0],[0,0.4,0],[ 0,0,1.2]) def Newton (x0): i = 0 IMax = x = x0 Delta = 1 alpha = 1 while I<imax and delta>10** ( -5): p =-dot (LINALG.INV (Hessian (x)), Jacobian (x)) xold = x x = x + alpha*p Delta = Sum ((x-xold) **2) i + = 1 print x x0 = Array ([ -2,2,-2]) Newton (x0)
In the above example, since the objective function is a two-time convex function, the Taylor expansion equals the original function, so the optimal solution can be obtained once.
The main problems of Newton method are: Hesse matrix is irreversible when the inverse of the matrix can not be computed complex n cubic, when the problem scale is large, the computational amount is very big, the solution is to use quasi-Newton method such as BFGS, L-bfgs, DFP, Broyden ' s algorithm to approximate. If the initial value is too far from the local minimum, Taylor expansion does not have a good approximation of the original function 3) Levenberg–marquardt algorithm
Levenberg–marquardt algorithm can combine the advantages of the above two optimization methods, and make improvements to the shortcomings of both. Unlike line search, LMA belongs to a "trust region" method, and Newton's method can actually be regarded as a trust region method, that is, local information is used to model the function, and the local minimum value is obtained. The so-called Trust domain method, which starts from the initial point, assumes that a maximum displacement s (Newton's method) can be relied upon to obtain a true displacement by finding the most advantage of an approximate function (two times) of the target function, which is centered on the current point and is in the radius of S. After the displacement is obtained, the value of the target function is computed, and if it satisfies a certain condition to decrease the value of the target function, then the displacement is reliable, then the computation continues to be iterated by this rule, and if it cannot satisfy certain conditions for the decrease of the target function value, the range of the trust region should be reduced and then solved again.
LMA was first proposed to solve the optimization problem of the least squares curve fitting, and for the known parameters of random initialization, beta, the target value is:
Approximation of a first-order Jacobi matrix for a fitted curve function:
Then the surrounding information of S function is inferred:
What is the minimum value of the S function when the displacement is? Through the concept of geometry, S gets the smallest when the residuals are perpendicular to the span space of the J matrix (as for why.) Please refer to the last section of the previous blog)
We modify this formula slightly, adding damping coefficients to get:
Is the Levenberg-Marquardt method. This method only calculates the first-order biasing, and is not the jacobia matrix of the objective function, but the Jacobia matrix of the fitted function. When the large trust domain is small, this algorithm will approach the steepest descent method, when it is small, the trusted domain is large and will approach the Gaussian-Newton method.
The algorithm process is as follows: given a primary value x0 when and does not reach the maximum number of iterations repeated execution: calculates the moving vector calculation Update value: Calculate the real reduction of the objective function and the ratio of the predicted reduction if, accept the update value else if, the approximate effect is good, accept the updated value, Expand the trusted domain (that is, reduce the damping factor) else: The target function is increasing, rejecting the update value, reducing the trusted field (i.e. increasing the damping coefficient) until the maximum number of iterations is reached
Wikipedia introduces Gradient descent with Rosenbrock function with slender canyons
Shows the zig-zagging sawtooth phenomenon:
How to optimize efficiency with LMA. To apply to our previous LMA formula, there are: