Unconstrained optimization Method

Source: Internet
Author: User

In this paper, there are several common gradient-based methods in unconstrained optimization, mainly gradient descent and Newton method, BFGS and L-BFGS algorithm, the problem form of unconstrained optimization is as follows, for $x \in \mathbb{r}^n$, the objective function is:

\[\MIN_XF (x) \]

Taylor series

The gradient-based approach involves the Taylor series, a brief introduction to the Taylor series, which means that the function $f (x) $ in the neighborhood of the point $x _0$ has $n +1$ derivative, then the neighborhood $f (x) $ can be expanded to $n $ order Taylor series:

\[f (x) = f (x_0) + \nabla f (x_0) (X-X_0) +\frac{\nabla^2f (x_0)}{2!} (X-X_0) ^2+...+\frac{\nabla^nf (x_0)}{n!} (X-X_0) ^n\]

Gradient Descent method

The gradient descent method is also called the steepest descent method, which is a method of finding the extremum by iteration, which uses the first derivative information of the objective function, and the direction of the target function is the quickest one for each iteration, and the target function $f (x) in what direction is the fastest descent? The answer is undoubtedly the gradient direction, here is a proof, suppose $x $ along the direction $d \in \mathbb{r}^n$ direction iteration, here $d $ for the unit vector, that is $| | d| | = 1$, in addition to a step parameter to do $\alpha$, representing each step along the $d $ walk, will $f (x) $ in the $x $ expanded to the Taylor show type:

\[f (x + \alpha D) = f (x) + \alpha \nabla f (x) d\]

The "$\approx$" should be taken strictly in the sense that only the first order of Taylor is taken. The goal is to make the iteration after the target function value $f (x + \alpha D) $ as small as possible, even if $f (x) –f (X+AD) $ as large as possible, according to:

\[f (x) –f (X+AD) =–\alpha \nabla f (x) d \]

The $\alpha$ of the step parameter is ignored, which can be greatly $f (x) –f (X+AD) $ equivalent to:

\[\min \nabla F (x) d \]

$\nabla f (x) d$ represents the product of two vectors, when the vector size is the same, the direction of the opposite can be achieved by the minimum value, so in order to make the iteration with the objective function at the fastest speed down, $d $ should be the inverse direction of the gradient, that is, negative gradient direction, that is $x $ should be along the $-\nabla F direction, the gradient algorithm is as follows:

\[X^{K+1}: = X^{k}–\alpha \nabla f (x_k) \]

Newton method

Newton's method is also an iterative method, different from the gradient descent, the method introduces the second derivative information, assuming that the current iteration to K times, the objective function $f (x) $ at $x _k$ to expand into the Taylor series:

\[f (x) \approx \phi (x) = f (x_k) + \nabla f (x_k) (x-x_k) + \frac{1}{2} (X–x_k) ^t\nabla^2 F (x_k) (x-x_k) \]

In this about $x $ two function, and the derivative of the $0$, the point can be used as the next target function iteration point $x _{k+1}$, people often say Newton method is surface fitting, this surface is about $x _k$ neighborhood value $x $ two functions, both ends of the $x $ derivative, can be :

\[\nabla\phi (x) = \nabla F (x_k) + \nabla^2 f (x_k) (x-x_k) = 0 \]

i.e.: \[x_{k+1} = X_k-\nabla-2 f (x_k) (\nablaf (x_k) \]

The formula of the $\nabla f$ is a gradient vector, $\nabla^2 f$ for the Hessian matrix, for convenience, $\nabla F (x_k) $ kee $g _k$, $\nabla^2 f$ do $H _k$, in the form of the following:

\[g = \nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \ \frac{\partial f}{\partial x_2} \ \vdots\\\frac{\ Partial f}{\partial x_n} \ \ \end{bmatrix} \ \ \ \ \ \
H = \nabla^2 f = \begin{bmatrix}
\frac{\partial^2 F}{\partial^2 x_1}& \frac{\partial^2 f}{\partial x_1 x_2} & \cdots &\frac{\partial ^2 F} { \partial X_1x_n} \ \
\frac{\partial^2 F}{\partial^2 x_1x_2}& \frac{\partial^2 f}{\partial x_2^2} &\cdots &\frac{\partial ^2 f} {\partial X_2x_n} \ \
\vdots & \vdots & \ddots & \vdots \ \
\frac{\partial^2 F}{\partial^2 x_nx_1}& \frac{\partial^2 f}{\partial x_n x_2} &\cdots &\frac{\partial ^2 f} {\partial x_n^2} \ \
\end{bmatrix} \]

Newton's method is two times convergent, and the Order of convergence is 2. The general objective function is presented as two functions in the vicinity of the most advantageous, so it can be imagined that the convergence of Newton iterative method near the most advantageous is relatively fast. And in the first few steps of the search, we use gradient descent method to converge is relatively fast. The two methods can be combined to achieve a satisfactory result.

Newton's method requires the Hessien matrix to be positive definite

Although this Newton method has two convergence, it is required that the initial point should be as close as possible to the minimum point, otherwise it may not converge. In the process of calculation, the second derivative of the objective function needs to be calculated continuously and the computational amount is large. Moreover, the Hesse matrix can not always maintain positive definite, will cause the direction of the algorithm can not be guaranteed to be f (x) in the direction of XK, so the Newton method failure (only Hesse matrix positive definite, to ensure that the Fxfx drop in xkxk).

Newton method is the second order convergence, the convergence speed of the Newton method every time the calculation needs to calculate the second derivative of the objective function, difficult, and can not guarantee that the Hesse matrix of the positive qualitative Newton method is to use a two-time surface to fit the current position of the local surface, and gradient descent method is a plane to fit the current local plane, Normally, the fitting of two times surface is better than plane, so the descent path chosen by Newton method will be more consistent with the true optimal descent path. The quasi-Newton method does not need to calculate the second derivative, and the approximate matrix replaces the Hesse matrix.

Quasi-Newton method

Bfgs

L-bfgs

In the neighborhood of X_k is expanded to Taylor, which is about the two functions of X, the derivative of the function is 0 points that is an extreme point of the two function.

1, according to three information: (Xk,f (XK), (Xk,f ' (XK)), (xk,f "(XK)) The fitting curve, the essence is parabolic (this is what I said in class," parabolic "origin, although other literature seems not to say that-but this is certainly true, Take a look at the Taylor Show).
2, since the second-order lead F ' (XK) may be 0 or even negative, then the iterative xk+1 may not be the descent direction (if 0, it cannot be calculated).
3, because the second derivative is used, it can be roughly considered to be two times convergent-if the exact calculation, then calculate the limit of the ratio of two times value.

http://tangxman.github.io/2015/11/19/optimize-newton/

Http://www.cnblogs.com/richqian/p/4535550.html

Http://www.cnblogs.com/zhangchaoyang/articles/2600491.html

Unconstrained optimization Method

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.