Optimized learning rate-1-backtracking linear search and two-time interpolation linear search

Source: Internet
Author: User

This chapter summarizes the knowledge of optimizing the learning rate, and the pre-knowledge is "linear regression, gradient descent algorithm", so if this chapter you look at the foggy even the learning rate is what you do not know, you need to first pre-knowledge to get it done.


Other Notes

Because the pre-knowledge of this summary is "linear regression, gradient descent algorithm", then the content is "to seek the minimum value of the objective function f (x)" for the purpose.

But don't worry about finding the maxima, because adding a minus sign to F (x) directly converts the problem to a minimum problem.

In the way, personal feeling is because of the research so much to find the smallest things, so everyone for the sake of convenience, usually encountered to find the greatest value of the problem first turned it into the minimum value, so will often see this sentence "Custom we will convert it to the minimum value" ....


What is optimized learning rate

indicated by gradient descent.

If I had to iterate 10 times when I fixed the learning rate at 1, then if I changed the learning rate to the first step of 8 and the second step to 2, it might converge twice. (The numbers here are not very accurate, but that's probably what it means)

Obviously, the latter converges faster, and the above adjustment of learning rate is to optimize the learning rate.

PS: It is like a person down the hill, if he at each step of 1M pace down the hill may have to walk 10 hours, if he started with every step of 8M, and then walked more than half in each step of 2M, then the time to reach the foot will be significantly shortened.

Say, although in writing this article I still immersed in this line, but I hear are some "work when many times on the forehead to choose a learning rate is good, if the feeling convergence is too slow to adjust a little bit" such content, feel no need to spend too much effort on this.

But personal feeling, if can grasp this knowledge, that work will be more effective.


How to optimize the learning rate

This is a little bit of talk.

Since our goal is to optimize the learning rate, we need to first convert the perspective: You see, for a function, if our goal is to find the parameter X in the function, then we think of the function as a function of x. Similarly, since the goal here is the learning rate α in the function, we consider the function as a function of α. And because the general function model is: F (XK+ΑDK), where the DK is the search direction (as for gradient descent, the DK is the negative gradient direction), so after the conversion perspective, we see it as a function of α (α), namely:

H (α) = f (XK+ΑDK)

PS: The general learning rate is greater than 0, that is α>0

Since it is already a function of α, that for the gradient descent, our goal is from "X is the value of the function of the smallest--note: The sample xk is known, the current direction of the search DK is the gradient can also be found, but which XK will make the function least known", namely:

Arg_xmin F (XK+ΑDK)

It becomes "the quickest time for a function to fall when looking for α, given the XK and DK, that is, to ask for the function H (α) of α" α=? , the function h (α) is the smallest ".

This is simple, the derivation of H (α) and the other derivative is 0 chant, namely:

H ' (α) =▽f (XK+ΑDK) tdk= 0

The alpha is a very good learning rate α for the current XK.

Draw a picture like this:

Once the learning rate is fixed, it may only fall so much at a time:

The above method can drop so much at once (pink is tangent)

However, if you are careful, it should be a bit around now, that is: H (α) is not f (XK+ΑDK) it? So I find a minimum value for H (α) is not the minimum value of f (XK+ΑDK)? According to the theory here, I'm not going to let the function converge at once? How does the above diagram not converge at once? You're drawing the wrong one!

Well, the above figure is indeed somewhat inaccurate, is to help understand, so I would like to add the following sentence: the derivation of H (α) and the derivative of 0 can be a very good learning rate α for the current XK, but this α may not allow the original function to converge once.

The reason for this is that we have previously transformed the function into a function h (α) of α by "angle of view", so if you draw an axis, the x axis of the axis becomes the alpha axis, that is: the function image becomes an image of alpha (which is different from the original function's image). So the H (α) derivative and the derivative of 0 can be derived from an image of the minimum point of α, then got a "good learning rate α" for the current XK, but because "the image of α and the original function of the image of X is two images", so this α may not allow the original function to converge at once.

However, although the original function can not be one-time convergence, but in any case, it is better than fixing α, right.


Now summarize the above steps:

1, the use of the original function of α derivative, to find the learning rate α;

2. Update functions in search direction (e.g. gradient descent)

3, repeat the above two steps until convergence.


A simpler approach (binary linear search)

To tell the truth, although the above steps are no problem, but if the calculation of H ' (α) = 0 points nausea?

Did (╯‵-′) ╯︵┻━┻ not beg?

Of course!

And listen to my one by one-way.

First, for F (XK+ΑDK), if you make α= 0, there are:

H ' (0) =▽f (xk +0*dk) TDK =▽f (XK) TDK

The above ▽f (XK) is a gradient (which is the derivation of the original function AH), DK is the direction of the search, if the DK is the negative gradient direction: DK =-▽f (XK), there are: H ' (0) < 0, that is α= 0 o'clock, H ' (0) < 0.

At this point if we find a large enough α, so that H ' (α) >0, then there must be some α, can make H ' (α) = 0, and this α is the learning rate we are looking for.

to! Is! Using binary find is OK! such as:

For h ' (α1) < 0, h ' (α2) > 0

If H ' ((α1+α2)/2) < 0, then another a1 = (α1+α2)/2

If H ' ((α1+α2)/2) > 0, then another a2 = (α1+α2)/2

Repeat the above steps until you find H ' (α) = 0, so that Alpha is found.


A little bit more optimization (backtracking linear search)

It is convenient to find the binary above, but with H ' (α) =0 as a condition for finding alpha, it is convenient to use the Arimijo criterion of backtracking linear search as a condition of judgment.

The first is the Armijo guidelines: First give a large learning rate, and then continue to reduce the learning rate, if the function f (XK+ΑDK) The current learning rate so that the function from the current position F (XK) reduced to a certain extent than the predetermined expectations, then this learning rate meets the requirements .

What do you mean? You see, for function f (XK+ΑDK), since it is the minimum value, then there is a new value F (xk+1) for the current value F (XK), so if we use F (xk+1) as a preset expectation, we want F (XK) In the case of a learning rate α reduced to a certain degree can reach F (xk+1), then this α is the alpha we want to get, right. And because this reduction is the formula in the Armijo criterion:

C1α▽f (XK) tdk,c1∈ (0, 1)

Because DK generally takes the negative direction of the gradient, so the expression in the above words is:

F (xk+1) = f (xk) + c1α▽f (XK) tdk,c1∈ (0, 1)

But in the calculation to achieve the above equal sign is very troublesome, because we are constantly reducing the learning rate, as long as we are not sure of the extent of reduction, there will be "last also F (xk+1) < F (XK) + c1α▽f (XK) TDK, the next on F (xk+1) > F (XK) +c1α▽f ( XK) TDK's case.

So for convenience, as long as F (xk+1) ≤f (XK) + c1α▽f (XK) TDK is OK, and this is the Armijo criterion of the equation .

PS: Why do you want to give a large learning rate after the continuous reduction? Is it better to choose this larger learning rate directly?

See:

If the study rate is too large suddenly from X1 to the X2, that also ask for a fart of the minimum value AH.

So we have to set a expectations, such as: You fall to f (X3) on the line. This will control the learning rate to a value we expect.


Similarities and differences of backtracking and binary linear search

The goal of the binary linear search is to obtain the optimal step approximation for the H ' (α) ≈0, while the backtracking linear search relaxes the constraint on the step length, as long as the step can make the function value have a large enough change.

The binary linear search can reduce the number of descent, but it takes a lot of cost to calculate the optimal step size. Backtracking linear search finds a similar step size.


Backtracking linear Search code (from Shambo teacher of the Academy of small Elephants)

Red Box: If the learning rate a does not meet H ' (a) > 0 will expand a twice times, so the green box to reach the code when a has met H ' (a) > 0. Or the red box is to find the "first" is not satisfied with the decline in the function of the learning rate. Under normal circumstances, as long as the negative gradient in the direction of the decrease in the small value, Next<now is constant, so, constantly execute the Red box code, can continue to raise the learning rate until the discovery of the condition is not satisfied (ie: "too big"), so as to ensure that the next green box effective execution.

Green Box: If the learning rate does not meet the Armijo criteria, then the learning rate a half to see if the new learning rate is satisfied.

Finally, return to the learning rate.


An improved two-time interpolation method for backtracking linear search

Why is it not over yet?

I have thought so, but the improvement is endless, the fate of it--....

However, before introducing the interpolation method, we have to say a very simple preparatory knowledge, as follows:

If you know 3 points, then you can determine that a two-time curve passes through these three known points, in other words, to determine a two-time curve requires 3 types of information, so we can think: if the topic is given at a point x1 the function value y1=f (x1), at the point where the tangent value is X1 at the Guide value f ' (x1), the function value at the X2 point y2=f (x2), then it is also possible to determine a two-time function, see:

And if x1=0,x2=a, then the equation for this two-time function would look like this:

PS: This is calculated as this: assuming that the equation for this two-time function is f (x) = px2 + QX + Z, because f (0), F ' (0), F (a) are known, F ' (x) can be obtained, that is F ' (x) = 2px + Q, that the three values into F (x) and F ' (x) can be P, q, Z to find out, the above formula is obtained.

The way to find extreme values for this equation is:

The hypothetical equation is: h (a) = qx2 + px + Z

Then x can get the extremum at-p/(2q)

This is the knowledge of junior middle school, so don't ask why, like Junior high school students just take to use as well.

All right, the preparation is finished, the interpolation method is described below.

After the previous content has been known: we look for a study rate of a method is to find an H ' (a) < 0, an H ' (a) > 0, and then keep the binary search, until H ' (a) = 0.

Is there a better way? Some.

First, for the learning rate A, its function h (a) is a two-time function.

PS: A generation of H (a) has a value, and finally a curved curve is drawn on an axis with a horizontal h (a) as the longitudinal axis, which is a two-time curve.

Then for this two-time function we summarize the known information:

1,h (a) must go through two points: the value of a=0 at H (0), the current readiness attempt's learning rate a0 The value H (A0). where h (0) is the current function value f (XK), because for h (α) = f (xk+αdk), if a=0, there is h (0) = f (XK), and the current readiness to try the learning rate if you meet the Armijo criteria directly return to the learning rate, otherwise it is necessary to find a better learning rate.

2, the derivative of the current XK can be obtained, that is, F ' (XK) known, that is, H ' (0) is known.

i.e.: Known h (0), H ' (0), H (A0)

Next, using the preliminary knowledge, we can construct a two-time function to approximate the curve of the learning rate A, namely:

PS: This is only approximate, because the curve is determined only by 3 data, and the real H (a) is somewhat different, however, although this curve and the real curve have some errors, but can be used, which is enough.

And what are we aiming at? Yes, for H ' (a) =0 A, that is, the extremum of H (a), then directly use the knowledge of junior high School

- 1

For HQ (a) The extremum is OK, this is much simpler, junior high school students will.


Process Summary

1, using the formula 1 to seek A1

2, if the A1 meet the Armijo criterion, then output the learning rate, or continue to iterate.


Code (Shambo teacher from the Academy of small Elephants)

Final summary

Generally speaking, backtracking linear search and two-time interpolation linear search can basically meet the needs in practice.

Optimize Learning rate-1-backtracking linear search and two-time interpolation linear search

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.