Introduction to several common optimization algorithms for machine learning789491451. Gradient Descent method (Gradient descent) 2. Newton's method and Quasi-Newton method (Newton ' s method & Quasi-Newton Methods) 3. Conjugate gradient method (conjugate Gradient) 4. Heuristic Optimization Method 5. Solving constrained optimization problems--Lagrange multiplier method
Each of us in our life or work encountered a variety of optimization problems, such as each enterprise and individual to consider a problem "at a certain cost, how to maximize profits" and so on. The optimization method is a mathematical method, which is a general term for some disciplines that study how to search for certain factors under a given constraint, so as to make certain (or some) indicators reach the optimum. With the deepening of learning, bloggers are increasingly discovering the importance of optimization methods, most of the problems encountered in learning and working can be modeled into an optimization model for solving, such as the machine learning algorithm we are now learning, most of the machine learning algorithm is the essence of the establishment of optimization model, The optimization method is used to optimize the objective function (or loss function) to train the best model. The most common optimization methods are gradient descent method, Newton method and Quasi-Newton method, conjugate gradient method and so on.
1. Gradient Descent method (Gradient descent)
The gradient descent method is the simplest and most commonly used optimization method. The gradient descent method is simple, and when the objective function is a convex function, the solution of the gradient descent method is the global solution. Under normal circumstances, the solution is not guaranteed to be the global optimal solution, the gradient descent method is not necessarily the fastest speed. the optimization idea of gradient descent method is to use the current position negative gradient direction as the search direction, because this direction is the fastest descent direction of the current position, so it is also called the "steepest descent method". The closer the steepest descent method is to the target value, the smaller the step, the slower the progression. The search iterations of the gradient descent method are as follows:
Disadvantages of Newton's method:
(1) The convergence speed is slowed down near the minimum value, as shown;
(2) A straight line search may cause some problems;
(3) The "zigzag" is likely to fall.
It can be seen that the convergence rate of the gradient descent method is significantly slower in the region near the optimal solution, and the gradient descent method is used to solve many iterations.
In machine learning, two kinds of gradient descent methods are developed based on the basic gradient descent method, namely the stochastic gradient descent method and the batch gradient descent method.
For example, for a linear regression (Linear Logistics) model, suppose the following H (x) is a function to fit, J (theta) is a loss function, Theta is a parameter, to iterate over the value of the solution, Theta solves the function H (Theta) that will eventually fit. It came out. where M is the number of samples in the training set, n is the number of features.
1) Batch gradient descent method (batch Gradient DESCENT,BGD)
(1) The J (Theta) is biased to Theta to obtain a gradient corresponding to each theta:
(2) Since the risk function is to be minimized, each theta is updated according to the gradient negative direction of each parameter theta:
(3) from the above formula can be noted that it is a global optimal solution, but each iteration step, will be used to the training set all the data, if M is large, then it is conceivable that the iterative speed of this method will be quite slow. So, this introduces another method--random gradient descent.
For the batch gradient descent method, the sample number m,x is n-dimensional vector, the first iteration needs to bring all M samples into the calculation, the iterative calculation is m*n2.
2) Random gradient drop (random Gradient DESCENT,RGD)
(1) The above risk function can be written as follows, the loss function corresponds to the granularity of each sample in the training set, and the above batch gradient drop corresponds to all training samples:
(2) The loss function of each sample, the corresponding gradient is obtained for Theta, to update the theta:
(3) The random gradient descent is to iterate through each sample to update once, if the sample size is very large (for example, hundreds of thousands of), then perhaps only tens of thousands of or thousands of of the sample, it has been theta iterative to the optimal solution, compared to the batch gradient above the lower, iterative need to use a hundred thousand of training samples, One iteration is unlikely to be optimal, and if you iterate 10 times, you need to traverse the training sample 10 times. However, one of the problems associated with SGD is that the noise is more bgd, making SGD not each iteration toward the overall optimization direction.
The random gradient descent uses only one sample per iteration, the iteration is calculated as N2, and when the number of samples is large, the velocity of the random gradient descent iteration is much higher than that of the batch gradient descent method. the relationship between the two can be understood as follows: The stochastic gradient descent method is at the expense of a small fraction of the accuracy and an increase in the number of iterations, in exchange for an overall optimization efficiency improvement. The increased number of iterations is much smaller than the sample count.
Summary of the batch gradient descent method and the stochastic gradient descent method:
The batch gradient descent---Minimize the loss function of all training samples, so that the final solution is the global optimal solution, that is, the solution parameter is to minimize the risk function, but it is inefficient for large-scale sample problem.
Random gradient descent---Minimize the loss function of each sample, although not every iteration of the loss function is toward the global optimal direction, but the direction of the large whole is to the global optimal solution, the final result is often near the global optimal solution, suitable for large-scale training samples.
2. Newton's method and Quasi-Newton method (Newton ' s method & Quasi-Newton Methods)
1) Newton method (Newton ' s method)
Newton's method is an approximate method for solving equations on real and complex fields. The method uses the first few items of the Taylor series of the function f (x) to find the root of the equation f (x) = 0. The greatest characteristic of Newton's method is that it converges fast.
specific steps:
First, select a proximity function f (x) 0-point x0, calculate corresponding f (x0) and tangent slope f " (x0 ) (Here f ' = function f derivative "). Then we calculate the crossing point (x0, f (x0)) and the slope is f " (x0) line and x axis intersection x coordinates, which is the solution of the following equation:
We name the x-coordinate of the newly obtained point as x1, usually X1 is closer to the solution of Equation f (x) = 0 than x0. So we can now use x1 to start the next iteration. The iterative formula can be simplified as follows:
has proven that if f ' is continuous, and 0 points to be asked x is isolated, then at 0 points x, as long as the initial value x0 is located in the adjacent area, then Newton's method must converge. Also, if f "(x) is not 0, then the Newton method will have a performance of square convergence. Roughly speaking, this means that every iteration, The effective number of Newton's results will increase by one more times. An example of the process of executing a Newton law.
Because Newton's method is based on the tangent of the current position to determine the next position, so the Newton method is also very vividly known as the "tangent method." The search path for Newton's method (two-dimensional case) is as follows:
Newton method Search Dynamic Example diagram:
Comparison of efficiency between Newton's method and gradient descent method:
In essence, Newton's method is the second-order convergence, and the gradient descent is the first convergence, so Newton's method is faster. If more commonly said, for example, you want to find a shortest path to the bottom of a basin, gradient descent method each time only from your current position to choose one of the most slope direction, Newton method in the choice of direction, not only will consider whether the slope is large enough, but also consider whether you take a step, the slope will become larger. So, it can be said that Newton's method than the gradient descent method to see a little farther, can go to the bottom faster. (Newton's eyes more long-term, so less detours; In contrast, the gradient descent method only considers the local optimal, without the global idea.) )
According to the explanation on the wiki, the Newton method is to use a two-time surface to fit the local surface of your current position, and the gradient descent method is to fit the current local surface with a plane, usually the two-times surface is better than the plane, Therefore, the descent path chosen by Newton method will be more consistent with the real optimal descent path.
Note: The iterative path of the red Newton method, the green one is the iterative path of the gradient descent method.
The advantages and disadvantages of Newton's method are summarized:
Advantages: Second order convergence, fast convergence speed;
Disadvantage: Newton method is an iterative algorithm, each step needs to solve the objective function of the Hessian matrix inverse matrix, the calculation is more complex.
2) Quasi-Newton method (Quasi-Newton Methods)
The quasi-Newton method is one of the most effective methods for solving nonlinear optimization problems, which was put forward by physicist W.c.davidon of Argonne National Laboratory in the 1950s. This algorithm, designed by Davidon, was one of the most creative inventions in the field of nonlinear optimization. Soon R. Fletcher and M. J. D. Powell confirms that this new algorithm is much faster and more reliable than other methods, making the subject of nonlinear optimization leap through the night.
The essence of the quasi-Newton method is to improve Newton's method to solve the problem of the inverse matrix of the complex Hessian matrix every time, it uses the positive definite matrix to approximate the inverse of the Hessian matrix, thus simplifying the complexity of the operation. The quasi-Newton method, like the steepest descent method, requires only the gradient of the objective function to be known at each iteration. By measuring the change of gradient, a model of the objective function is constructed to produce the super-linear convergence. This type of method is much better than the steepest descent method, especially for difficult problems. In addition, because the quasi-Newton method does not need information of the second derivative, it is sometimes more effective than Newton's method. Today, the Optimization software contains a large number of quasi-Newton algorithms to solve unconstrained, constrained, and large-scale optimization problems.
Specific steps:
The basic idea of quasi-Newton method is as follows. First, construct the two-xk model of the target function in the current iteration:
here BK is a symmetric positive definite matrix, so we take this two times the optimal solution of the model as the search direction, and get a new iteration point:which we require step AK meet Wolfe conditions. Such iterations are similar to Newton's, and the difference lies in the approximation of the Hesse Matrix BKinstead of the real Hesse matrix. So the key point of the quasi-Newton method is the matrix BK in each iteration . the update. Now suppose to get a new iteration xk+1 and get a new two-time model:we use the information from the previous step whenever possible to select BK. In particular, we askthereby obtainingThis formula is called the secant equation. the commonly used quasi-Newton method has DFP algorithm and BFGS algorithm. 3. Conjugate gradient method (conjugate Gradient)Conjugate Gradient method is a method between the steepest descent method and Newton's method, it only needs to use the first derivative information, but overcomes the disadvantage of the steepest descent method convergence slow, and avoids the disadvantage that Newton method needs to store and compute the Hesse matrix and find the inverse, the conjugate gradient method is not only one of the most useful methods to solve large-scale linear equations. is one of the most effective algorithms for solving large-scale nonlinear optimization. In various optimization algorithms, the conjugate gradient method is very important. Its advantage is that the required storage capacity is small, has step convergence, high stability, and does not require any external parameters. For specific implementation steps, please join the Wiki encyclopedia conjugate gradient method. The path comparison for searching the optimal solution for the conjugate gradient method and the gradient descent method:Note: Green is gradient descent method, red represents conjugate gradient methodmatlab code:
function [x] =Conjgrad (a,b,x) r=b-a*X p=R Rsold=r‘*r;for i=1: Length (b) Ap=a*p; alpha=rsold/(P ' *ap); X=x+alpha*p; R=r-alpha*AP; Rsnew=r *R; if sqrt (rsnew) <1e-10 breakP; Rsold=rsnew; endend
4. Heuristic Optimization method
Heuristic method refers to the method that people take when they solve the problem and find it according to the rule of experience. It is characterized by the use of past experience in the solution of problems, the selection of methods that have been effective, rather than the systematic and determined steps to seek answers. There are many kinds of heuristic optimization methods, including classical simulated annealing method, genetic algorithm, ant colony algorithm, particle swarm algorithm and so on.
There is also a special optimization algorithm called multi-Objective optimization algorithm, which is mainly aimed at simultaneous optimization of multiple targets (two and more than two) optimization problem, which is more classical algorithm has NSGAII algorithm, moea/d algorithm and artificial immune algorithm.
This section will be summarized in the following blog post in detail, please look forward to. This part of the introduction has been in the blog "[Evolutionary algorithm] Evolutionary algorithm introduction," A summary of the introduction, interested Bo Friends can be consulted (2015.12.13).
5. Solving constrained optimization problems--Lagrange multiplier method
Introduction to Lagrange multiplier method see my other blog: the Lagrange multiplier act
Poll's Notes
Blog Source: http://www.cnblogs.com/maybe2030/
This article copyright belongs to the author and blog Garden All, welcome reprint, reprint please indicate source.
< If you think this article is good, for your study to bring a little help, please help click on the lower right corner of the recommendation >
Introduction to several common optimization algorithms for machine learning