Unconstrained optimization Problems

Source: Internet
Author: User
Tags constant

It is estimated that some readers will find this topic to be very mathematical, and natural language processing does not matter, but if you have heard the maximum entropy model, conditions with the airport, and know that they are widely used in natural language processing, and even you understand that its core parameter training algorithm is called LBFGS, So this paper is a preliminary introduction to this kind of quasi-newton method for solving unconstrained optimization algorithm.
In fact, the author of this series is my senior Jianzhu, his research in Chinese participle, language model is very deep, if you are interested in Srilm source code, you can refer to his personal blog "Srilm reading document series", very helpful. I once made an appointment to him, he said in his spare time learning maths, relatively busy, thought he did not have time to write to 52NLP, did not expect this evening he suddenly handed me this document, relatively long, I will be divided into several points on the blog. Thanks to his support for 52NLP, the following author is Jianzhu.

This paper mainly introduces the unconstrained optimization problem, at the same time, it introduces an algorithm commonly used to solve this kind of problem, namely Quasi-Newton method (Quasi-Newton). Before introducing unconstrained optimization problem, we first introduce the concept of unconstrained optimization from the intuition, and then introduce two important concepts of solving these problems: step length and direction. The choice of step is introduced into the important concept of line search, which introduces the important concept Quasi-Newton Method by the choice of direction. Therefore, this document is mainly divided into the following parts: Unconstrained optimization problem introduction, line Search, Quasi-Newton method and algorithm summary.

1. Unconstrained optimization
Readers who are unfamiliar with unconstrained optimization may want to ask what is unconstrained optimization. Here is an example to illustrate the problem.
  
The figure above shows an image of the unary function f (x), an unconstrained optimization problem, which is to solve the minimum value of the function f (x) without any restriction on the definition field or range, the above two minimum points are shown: one is the global minimum point and the other is the local minimum point. Constrained by the complexity of the algorithm, most of the unconstrained optimization algorithms can only guarantee the local minimum points. At this point the reader inevitably ask, since only the local minimum point, then why this kind of algorithm can also be applied. This is because in practical applications, many situations are abstracted as function form, and the local minimum point is the global minimum point for convex function, so as long as a minimum point of such function can be obtained, the point must be the global minimum point.
After understanding the unconstrained optimization problem above, we can begin to introduce the unconstrained optimization solution, for unconstrained optimization First we need to select an initial point x_0, as follows:
  
When the initial point is selected, the minimum point can be solved according to various unconstrained optimization algorithms. There are two main concepts involved in solving the process, that is, from the initial point along the "which direction" and "how far" to arrive at the next point. So-called "how far" that is the premise of the "step" concept, "which direction" is the direction of the concept.

2. Line Search
Line search is primarily used to address the concept of the previously mentioned step size, where the direction is determined so that it is necessary to determine how far x_k goes in that direction from the current point in order to go to the next appropriate point x_k+1. If the p_k represents the direction from the K point to the k+1 Point, X_k represents the current point, the x_k+1 represents the next point, and the a_k represents the step size, the following equation exists:

x_k+1 = X_k + a_k * P_K (1)

Here is a brief introduction to P_k, most of the line search methods require P_k to be in the descending direction, that is, moving from the current point along the p_k direction, resulting in reduced function values. Since the goal is to find the minimum value of a function, the best case is to find the A_k value that satisfies the global minimum of F (x_k+1) along the p_k direction, which can be expressed as:

Ø (A_k) = f (x_k+1) = f (x_k + A_k * p_k) (2)

  
Since the a_k that satisfies O (a_k) as the global minimum is involved in the calculation of a large number of F (x_k + a_k * p_k), if the minimum value is calculated from the derivation angle, it will also involve the calculation of ▽f_k+1 and the computational amount is large. Therefore, from the point of view of calculation, we can use the following more eclectic strategy.
After the direction is determined, each step of line search mainly involves two questions: 1) What kind of a_k is reasonable 2) how to choose the length of a_k. The following will be discussed along these two sides, first of all to discuss "what kind of a_k is reasonable", after identifying the problem, we can choose A_k on this basis.

Discussion on rationality of 2.1 a_k
The following is a discussion of the two conditions that the a_k needs to meet, and when a_k satisfies both conditions, it can be assumed that the step from the X_k point to the x_k+1 point has been determined. The first condition is sufficient decrease condition, from an intuitive point of view, the condition is mainly used to ensure that the function value of the x_k+1 point is less than the function value of the x_k point, the possibility of global convergence is satisfied after satisfying the condition. The second condition is curvature condition, from an intuitive point of view, the condition is mainly used to ensure that the x_k point through the step a_k the movement reached x_k+1, ▽f_k+1 less than ▽f_k.

2.1.1 Sufficient decrease condition
A_k choice must make the function value satisfies sufficient decrease condition, the condition can be described with the following inequalities:
(3)
The formula (1) is put into the equation, which can be obtained by:
(4)
It is necessary to explain the above inequalities:
A) f (x_k) represents the value of the function at point X_k
b) The gradient of the ▽f_k representation function at point X_k
c) P_k represents the direction from point X_k to X_k+1.
D) A_ K represents the step from point X_k to X_k+1 Point along the p_k direction.
e) C_1 is constant, need to meet 0< c_1 < 1, generally take c_1 to 1E-4 (Note: Quasi-Newton method requires 0< c_1< 0.5)
When P_k is a function descending direction, there are:

▽f_k * P_k < 0 (5)

Therefore inequality 4, namely requirements:

F (x_k+1), F (x_k) (6)

From a graphical point of view, when the function is at the K-point, only A_k is a variable in the above parameters, and the others are constants, so (4) can be re-described with the following inequalities:

Ø (A_k) ≤l (A_k) (7)

which
  
The following is a graphical representation of inequalities (7):
 
Therefore, as long as the choice of step A_k makes the function O (a_k) in the acceptable interval, it satisfies sufficient decrease condition.

2.1.2 Curvature Condition
The choice of a_k must make the function gradient value satisfy the curvature condition, the condition can be described by the following inequalities:
(8)
That is
(9)
It is necessary to explain the above inequalities:
A) The gradient of the ▽f_k+1 representation function at point x_k+1
b) The gradient of the ▽f_k representation function at point X_k
c) P_k represents the direction from point K to K+1 Point
d) c_2 is constant, need to meet 0< C_1 < c_2 < 1, generally take c_2 to 0.9
When P_k is a function descending direction, there are:

▽f_k * P_k < 0

Therefore inequality 9, namely requirements:

▽f _k+1≥c_2 *▽f_k (10)

From a graphical point of view, the inequality 10 requires that the function at the point x_k+1 change speed is lower than the x_k point change speed, which can be seen from the gradient at these two points, as shown in the figure below.
  

2.1.3 Wolfe Conditions
The so-called Wolfe conditions is sufficient decrease condition and curvature condition synthesis, that is, a_k need to meet the following two conditions:
(11)
(12)
After the graphical representation, as shown in the following illustration:
 
In practice, the strong Wolfe conditions, which is derived from Wolfe conditions, is often used, that is, A_k needs to meet the following two conditions:
The only difference from Wolfe Condtions is that strong Wolfe condtions avoids a large positive case of ▽f (X_k + a_k * p_k).

2.2 A_k Step Selection
After understanding the rationality of a_k, it is equivalent to obtain the ruler, on this basis we can choose the appropriate strategy to find the a_k. All the line search processes need to provide an initial point a_0 when calculating the a_k of each step, and then generate a series of {a_i} on that basis until a_i satisfies the conditions set out in section 2.1, at which point the a_k is determined to be a_i, or a suitable a_k is not found. Here we only introduce the current common strategy square interpolation and cubic interpolation method. So this section is divided into two parts, 2.2.1 describes the selection of a_k commonly used square interpolation and cubic interpolation method, 2.2.2 section describes the X_k point to X_k+1 Point, the direction is determined as P_k, the step a_k the specific calculation process.

2.2.1 square cubic interpolation method
When given an initial step a_0, if the initial step satisfies Wolfe conditions (or strong Wolfe conditions), then A_k is determined as a_0, the current point x_k step calculation process is over, otherwise, we can on this basis, Using known three information Ø (A_0), Ø (0), Ø ' (0), construct a two-time (square) interpolation polynomial to fit O (a_k). The two-time interpolation polynomial is as follows:
(15)
Observing the above two interpolation polynomial, it satisfies the following interpolation conditions:

Ø_q (0) =ø (0) Ø_q ' (0) =ø ' (0) ø_q (a_0) =ø (A_0)

By deriving the two-time interpolation polynomial (15) and making it zero, a value that causes the polynomial to get the minimum value is obtained, as follows:
(16)
If the a_1 satisfies Wolfe conditions (or strong Wolfe conditions), then A_k is determined to be a_1. Otherwise, a three cubic interpolation polynomial is constructed, and a value that makes the polynomial take the minimum value is obtained, and the formula for the A value is as follows:


(17)
If the a_i+1 satisfies the Wolfe conditions (or strong Wolfe conditions), then A_k is identified as a_i+1, otherwise interpolation is fitted using three-time interpolation polynomial. and choose to use a_i+1 corresponding O (a_i+1) and O ' (a_i+1) to replace the corresponding value a_i-1 or a_i, once you have determined the a_i-1 or a_i in one, each time the corresponding value, such as Replace A_i-1, each time with the new a_i+1 corresponding to the value of the replacement A_ I-1 until a Wolfe of Wolfe conditions (or strong conditions a_i+1) is found, A_k is identified as a_i+1, or no suitable a_i+1 is found.
It is necessary to explain briefly why only three interpolation polynomial is used instead of higher order interpolation polynomial. The reason is that the three-time interpolation polynomial has a good fitting effect on the specific value of the function at some point, and has good anti-overfitting effect.
Finally it is necessary to explain the selection of the initial step a_0, for Newton or Quasi-Newton methods, the initial step a_0 is always determined to be 1, which ensures that when the Wolfe conditions (or strong Wolfe conditions), we always choose the Unit 1 step size, because this step allows Newton or Quasi-Newton methods to achieve a faster convergence rate. When calculating the 0th step, the initial step a_0 is determined using the following formula:
(18)
The 1th step and its subsequent initial step a_0 are set directly to 1.

2.2.2 Step a_k Calculation
This section is to make a summary, that is, the integration of the front step needs to meet the conditions and step iteration calculation formula to give the step calculation of the specific process. The following assumes that we are at the X_k point, so select a step conditions that satisfies the Wolfe (or strong Wolfe) A_k from that point to go to the next point x_k+1. So we select the initial point a_0 = 1,newton or Quasi-Newton method in the initial step is always selected as 1. So there are:

1) Initialize A_xlefta_yleft0ø (a_x), Ø (A_Y), Ø ' (A_X), Ø ' (a_y), A_min, A_max
2) Initialize A_ILEFTA_0, Ø ' (A_I), Ø (A_I)
3) if Ø (a_i) >ø (a_x)

Note that the step a_i selection is too large, so that Ø (a_i) satisfies the Wolfe (or strong Wolfe) conditions step a_k should be located between A_x and A_i, using the square and cubic interpolation method to interpolate a_x and a_i respectively, to take two new steps A_ Quadratic and a_cubic, and take the two and a_x closer to the step, here is assumed to be a_quadratic, the new step is set to A_qudratic, while the following operations:

A_y left A_i
Ø (a_y) leftø (a_i)
Ø ' (a_y) leftø ' (a_i)
A_i+1 left A_quadrati

Also in this case, the description A_k is located between A_x and A_y

4) If Ø ' (a_i) *ø ' (a_x) < 0
Note that the step a_i selection is too large, so that Ø (a_i) satisfies the Wolfe (or strong Wolfe) conditions step a_k should be located between A_x and A_i, using the square and cubic interpolation method to interpolate a_x and a_i respectively, to take two new steps A_ Quadratic and a_cubic, and take the two and a_i closer to the step, here is assumed to be a_quadratic, the new step is set to A_qudratic, while the following operations:

A_y left A_x
Ø (a_y) leftø (a_x)
Ø ' (a_y) leftø ' (a_x)

A_x left A_i
Ø (a_x) leftø (a_i)
Ø ' (a_x) leftø ' (a_i)

A_i+1 left A_quadrati

Also in this case, the description A_k is located between A_x and A_y

5) If | Ø ' (a_i) | <| Ø ' (a_x) |
Note that the step a_i selection is too small, so that Ø (a_i) satisfies the Wolfe (or strong Wolfe) conditions the step a_k should be located between A_i and A_y, using the square and cubic interpolation method to interpolate a_x and a_i respectively, to take two new steps A_ Quadratic and a_cubic, and take the two and a_i closer to the step, here is assumed to be a_quadratic, the new step is set to A_qudratic, while the following operations:

A_x left A_i
Ø (a_x) leftø (a_i)
Ø ' (a_x) leftø ' (a_i)

A_i+1 left A_quadrati

Also in this case, the description A_k is located between A_x and A_y

6) If | Ø ' (a_i) | ≥| Ø ' (a_x) |
Note that the step a_i selection is too small, if the a_k belongs to the range a_x and a_y have been determined, then Ø (a_i) satisfies Wolfe (or strong Wolfe) conditions the step a_k should be between a_i and a_y, using cubic interpolation method for a_i and a _y interpolation, to find a new step a_cubic, the new step is set to A_cubic, otherwise if the a_x is less than a_i, the new step size is not large enough, so the new step is set to A_max, if the a_x is greater than or equal to a_i, then the new step is not small enough to set it to A_min, Also do the following:

A_x left A_i
Ø (a_x) leftø (a_i)
Ø ' (a_x) leftø ' (a_i)

If the range a_x and a_y that the a_k belongs to has been determined before

A_i+1 left A_cubic
else if a_x a_i
A_i+1 left A_max
else if a_x≥a_i
A_i+1 left A_min

7) calculation and judgment if a_i+1 make the following two (strong Wolfe condition) are established:

Then find a reasonable step a_k, set it to a_i+1

A_k left A_i+1

The X_k point step a_k the end of the calculation.
Otherwise go to 2) continue to calculate reasonable step size.

3. Quasi-Newton Method
In the 2nd section we learned about the concept of step size and how to use the line search method to calculate the step size from X_k to X_k+1 point. But we have overlooked an important concept, the "direction". From the 2nd section, we learned that from each point x_k to the next point of x_k+1, we need to go to the "direction", only "direction" to determine the good, can be based on the line search method to find the corresponding "step", so after solving the "step" calculation problem, Here we will look at how the "direction" of each step is determined. This section is divided into 2 parts, first we introduce the concept of direction through Newton method, on this basis, introduce Quasi-Newton method. It then introduces an important method in Quasi-Newton methods, and introduces the Lbfgs method algorithm for large-scale computation on the basis of BFGS method, and ends all the contents of this section.

3.1 Newton Method
I think we also vaguely remember the Newton method used to find the root of a unary function in calculus. This method can be described by a graph such as the following:

First we select an initial point x_0 and calculate to obtain its corresponding F (x_0), then the method with the curve y = f (x) in the (X_0,f (x_0)) tangent approximation of the curve, the tangent and x-axis intersection is recorded as x_1, point x_1 usually x_0 closer to the required root, the same method with point X_ The tangent of 1 approximates the curve, and the intersection of the tangent and the x-axis is x_2, and continues until it finds the solution that satisfies what we need to be sufficiently close to the real. Thus the Newton method obtains the K+1 approximation from the K-X_k approximation x_k+1 that is, the intersection of the tangent and the x-axis at the X_k point.
The tangent equation is:

Y–f (X_k) =▽f (x_k) (X–x_k)
= 0–f (X_k) =▽f (x_k) (X–x_k)
=>▽f (X_k) *x =▽f (x_k) *x_k–f (X_k)
= = x = X_k–f (x_k)/▽f (X_k)

Therefore x_k+1 = X_k–f (x_k)/▽f (x_k)
The process of finding a function root using Newton method above can be found by first selecting an initial point and building a model at the point to approximate the function.
Tangent model:
The tangent model at the corresponding point is used to approximate the function, then the root of the approximate model is computed to get closer to the next point of the function root, and the process continues until the root is found. By constructing the approximate model at each point above, it can be found that the model is much simplified relative to the original function, so it is easier to solve. The
now considers the minimum value problem for the function, similar to the method. First we select an initial point x_0 and construct an approximate model of the function at that point, and the approximate model we construct is the tangent model when the function root is obtained. Here we construct a parabolic model:

and solve the gradient of the model, at the same time it is zero, namely: ▽m_k (x+) = 0, on this basis to obtain x+, the value even if the model to obtain the minimum point. After the
derivation of the parabolic model M_k and makes it zero, the following formula is available:
(
)
is visible from the top, with the Newton method, the direction of each step is p_k, and the step size is 1. From the Newton method, it is not difficult to see that, if the Newton method is used, the function Hessian matrix ▽▽f (X_k) (second order) calculation is involved in the process of x_k+1 from each minimum approximation point x_k to the next approximation point. The Hessian matrix is not guaranteed to be positive definite at each point, and the following inequalities exist for positive definite matrices:
(a)
because the matrix cannot be guaranteed to be a positive definite matrix, the following

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.