Happy New year!
At the beginning of the new year, we will be in the blessing of others, for themselves secretly set a few small goals. So try now, run faster, it will make time seem slower ~
Today's content is
"Classic Optimization Algorithm"
Scenario Description
In view of the various optimization problems we encountered, the researchers proposed a variety of algorithms which have their own application scenarios, and gradually developed a research field with rigorous theoretical support-convex optimization [1]. In many of these algorithms, there are several classical optimization algorithms are worth remembering, to understand their application scenarios can help us in the face of new optimization problems, there is a solution to the idea.
Problem description
There's an unconstrained optimization problem in front of you.
Where the objective function L(·) is smooth. What are the optimization algorithms for solving this problem? What are their application scenarios?
Prior knowledge: Basic concepts of calculus, linear algebra, convex optimization
Solutions and Analysis
Classical optimization algorithms can be divided into two main categories: Direct method and Iterative method.
Direct method, as the name implies, is able to give the optimal solution of optimization problem directly. This method sounds very powerful, but it is not omnipotent. The direct method requires that the objective function satisfies two conditions. the first condition is that L (·) is a convex function. What is a convex function? Its strict definition can be found in the 3rd chapter of the literature [1]: For any x and y, any 0≤λ≤ 1, are established
An intuitive explanation is that a two point on an arbitrary function surface is connected to a line segment, and any point of that segment is not at the bottom of the function surface, as follows
Two points on the function surface are connected to a line segment,
Any point of the segment is not at the bottom of the function surface
If l(•) is a convex function, then a conclusion on page 140th of document [1] is thatθ* is the sufficient and necessary condition for optimal solution of optimal problem is l(•) the gradient at θ* is 0, i.e.
Therefore, in order to be able to solve the θ*directly, the second condition is that the upper formula has a closed-type solution. The classic example that satisfies both conditions is the ridge regression (ridge regression), where the objective function is
A little deduction can get the best solution (to try to deduce yo)
These two requirements of direct method limit its wide application, so in many practical problems, we will use iterative method. These methods iteratively revise the estimation of the optimal solution, that is, assuming that the current estimate is θ T, we want to solve the optimization problem
Thus get a better estimate of θt+1=θ t﹢ΔT. The iterative method can be divided into two kinds, first order method and second method.
First Order method for the function L(θ t﹢δ) to do first-order Taylor expansion, to obtain approximate
Since the approximation is only more accurate when δ is smaller, we can solve the approximate optimization problem with the l2 Regular
So the iterative update formula for the first order method is
The first order method is also called gradient subtraction, and the gradient is the first order information of the objective function.
Ji Jifa the function L(θ t﹢δ) to do the Taylor expansion, get the approximate formula
which
is the Hessian matrix of the function L(•) at θ T . We can solve the approximate optimization problem
To get the iterative updating formula of Ji Jifa
Ji Jifa is also called Newton Method, and the Hessian matrix is the second order information of the objective function. The convergence speed of the Ji Jifa is generally much faster than the first order method, but in the case of high dimension, the computational complexity of Hessian matrix inversion is greater, and when the objective function is non-convex, the Ji Jifa may converge to the saddle point (saddle points).
Extended Reading
Yurii Nesterov, a famous Russian mathematician, put forward an accelerating algorithm for first order method in 1983 [2], the convergence rate of the algorithm can reach the theory of convergence rate of first order method. Charles George Broyden,roger Fletcher,donald Goldfarb and David for the problem of high computational complexity for second-order matrix inversion The Pseudo-Newton algorithm, which was later called BFGS, was proposed by Shanno four in 1970 (L-BFGS algorithm for 3-6],1989-year expansion to low storage [7].
Photo of Charles George Broyden,roger Fletcher,donald Goldfarb and David Shanno
Reference documents:
[1] Boyd, Stephen, and Lieven Vandenberghe. Convex optimization. Cambridge University Press, 2004.
[2] Nesterov, Yurii. "A method of solving a convex programming problem with Convergence rate O (1/K2)." Soviet Mathematics Doklady. Vol. 27. No. 2. 1983.
[3] Broyden, Charles G. "The convergence of a class of Double-rank minimization Algorithms:2. The new algorithm." IMA Journal of Applied Mathematics 6.3 (1970): 222-231.
[4] Fletcher, Roger. "A New approach to variable metric algorithms." The Computer Journal 13.3 (1970): 317-322.
[5] Goldfarb, Donald. "A family of Variable-metric methods derived by variational means." Mathematics of Computation 24.109 (1970): 23-26.
[6] Shanno, David F. "Conditioning of Quasi-Newton methods for function minimization." Mathematics of Computation 24.111 (1970): 647-656.
[7] Liu, Dong C., and Jorge Nocedal. "On the limited memory BFGS method for large scale optimization." Mathematical Programming 45.1 (1989): 503-528.
Next Topic Preview
"The classical variant of the random gradient descent algorithm "
Scenario Description
Referring to the optimization method in deep learning, people would think of stochastic Gradient descent (SGD), but SGD is not the ideal balm, but sometimes it becomes a pit. When you design a deep neural network, if you only know to train with SGD, and in many cases you get a poor training result, you give up and continue to devote your energy to this depth model. However, the possible reason is that SGD is not in the process of optimization, resulting in the loss of a new opportunity for discovery.
Problem description
The most commonly used optimization method in deep learning is SGD, but SGD sometimes fails to give satisfactory training results, which is why? What are the changes that the researchers have made to improve SGD, and what are the characteristics of the SGD variants?
Hulu machine learning questions and Answers series | 16: Classic Optimization algorithm