Deep interpretation of the most popular optimization algorithms: Gradient Descent (Lite version)

Last Update:2017-09-16 Source: Internet

Author: User

Tags square root

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Objective

This article only to some common optimization methods to introduce and simple comparison, the detailed contents and formulas of various optimization methods have to seriously chew the paper, here do not repeat.

SGD

SGD refers to Mini-batch gradient descent, about batch gradient descent, stochastic gradient descent, and mini-batch gradient descent The specific difference will not elaborate. SGD now generally refers to mini-batch gradient descent.

SGD is the most common optimization method for every iterative calculation of the mini-batch gradient and then updating the parameters. That

Among them, the learning rate, is that the gradient sgd is completely dependent on the current batch gradient, so it can be understood to allow the current batch gradient to how much affect parameter updates

Cons: (just because of these shortcomings so many great gods have developed a variety of subsequent algorithms)

Choosing the right learning rate is difficult-use the same learning rate for all parameter updates. For sparse data or features, sometimes we might want to update faster for infrequently occurring features, for frequently occurring feature updates are slower, and SGD is less able to meet the requirements

SGD is easy to converge to local optimality, and in some cases may be trapped in the saddle point "originally written is" easily trapped in the saddle point ", after the review paper found that in fact, in the appropriate initialization and step size, the impact of the saddle point is not so big. Thanks for the @ Ice Orange.

Momentum

Momentum is the concept of simulating the momentum in physics, accumulating the previous momentum to replace the true gradient. The formula is as follows:

Which is the momentum factor

Characteristics:

At the beginning of the descent, use the last parameter update, the descent direction is consistent, multiply the larger can be good acceleration
When the mid-and late-fall, when the local minimum value back and forth, make the update amplitude increase, jump out of the trap
Ability to reduce updates when gradients change direction in summary, momentum can accelerate SGD in the relevant direction, suppress oscillations, and accelerate convergence

Nesterov

Nesterov a correction to the gradient update to avoid moving too fast and increasing sensitivity. Expand the formula in the previous section to:

As can be seen, there is no direct change in the current gradient, so the improvement of Nesterov is to let the previous momentum directly affect the current momentum. That

Therefore, after adding the Nesterov term, the gradient is calculated to correct the current gradient after a large jump. Such as:

Momentum first calculates a gradient (a short blue vector) and then makes a large jump (long blue vector) in the direction of accelerating the gradient, and the Nesterov first makes a big jump (a brown vector) in the previously accelerated gradient direction, calculates the gradient and then corrects (green ladder vector)

In fact, both the momentum and Nesterov items are designed to make the gradient update more flexible and targeted for different situations. However, the manual setting some of the learning rate is still somewhat blunt, and then introduce several methods of adaptive learning rate

Adagrad

Adagrad is actually a constraint on the learning rate. That

Here, a constraint is formed on a recursive form from 1 to a regularizer, to ensure that the denominator is not 0

Characteristics:

Early and smaller, regularizer larger, can enlarge gradient
Later in the larger time, the Regularizer is smaller and can constrain the gradient
Suitable for processing sparse gradients

Disadvantages:

As can be seen from the formula, it is still dependent on the manual setting of a global learning rate
If set too large, it will make the regularizer too sensitive, the adjustment of the gradient is too large
In the middle and later, the summation of the gradient squared on the denominator will be increasing, making the training end prematurely.

Adadelta

Adadelta is an extension of the Adagrad, the initial scheme is still adaptive to the learning rate constraints, but the computational simplification. Adagrad will accumulate all previous gradient squares, whereas adadelta only accumulate fixed-size items and do not store them directly, just approximate the corresponding average. That

At this point Adadelta is still dependent on the global learning rate, but the author has done some processing, after the approximate Newton iterative method:

Among them, the representative desires.

At this point, you can see that Adadelta is not dependent on the global learning rate.

Characteristics:

Training in junior high, the acceleration effect is good, soon
Jitter near the local minimum at the end of the training period

Rmsprop

Rmsprop can be counted as a special case of Adadelta:

At that time, it became the average of the sum of squares of gradients.

If you seek the root again, it becomes the RMS (mean square root):

At this point, this RMS can be used as a constraint on the learning rate:

Characteristics:

In fact, Rmsprop still relies on the global learning rate
Rmsprop is a kind of development of Adagrad, and Adadelta variant, the effect tends to be between the two
Suitable for non-stationary targets-good for rnn effect

Adam

Adam (Adaptive moment estimation) is essentially a rmsprop with a momentum term, which dynamically adjusts the learning rate of each parameter using the first-order moment estimation of the gradient and the second moment estimation. The advantage of Adam is that after biased correction, each iterative learning rate has a definite range, which makes the parameters more stable. The formula is as follows:

Among them, the first-order moment estimation of the gradient and the second-moment estimation, respectively, can be regarded as the estimate of the expectation, and is correct, which can approximate the unbiased estimation of the expectation. It can be seen that directly to the gradient of the moment estimation of memory no additional requirements, and can be dynamically adjusted according to the gradient, and the learning rate to form a dynamic constraint, and have a clear range.

Characteristics:

Combines the advantages of Adagrad's ability to handle sparse gradients and rmsprop to handle non-stationary targets
Low memory requirements
Calculate different adaptive learning rates for different parameters
Also suitable for most non-convex optimizations-for large datasets and high-dimensional spaces

Adamax

Adamax is a variant of Adam, which provides a simpler range for the upper limit of the learning rate. The changes in the formula are as follows:

It can be seen that the boundary range of Adamax learning rate is simpler

Nadam

Nadam is similar to Adam with the Nesterov momentum term. The formula is as follows:

It can be seen that Nadam has a stronger constraint on the learning rate, and also has a more direct impact on the updating of gradients. In general, you can use Nadam to achieve better results in areas where you want to use Rmsprop, or Adam.

Experience

For sparse data, try to use an adaptive optimization method for learning rate, without manual adjustment, and preferably with default values
SGD usually takes longer to train, but results are more reliable in the case of a good initialization and learning rate scheduling scenario
If you care about faster convergence and need to train deeper and more complex networks, it is recommended that you use an adaptive optimization approach to learning rates.
Adadelta,rmsprop,adam is a relatively similar algorithm, and behaves almost the same in similar cases.
Most can use Nadam to get better results if you want to use the Rmsprop, or Adam's place.

Finally show two can be a great picture, everything in the picture, Ah, the above is no use ...

Figure 5:SGD Optimization on loss surface contours

Figure 6:SGD Optimization on Saddle point

Reprint: Https://zhuanlan.zhihu.com/p/22252270?utm_source=qq&utm_medium=social

Deep interpretation of the most popular optimization algorithms: Gradient Descent (Lite version)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More