Deep interpretation of the most popular optimization algorithms: Gradient Descent (Lite version)

Source: Internet
Author: User
Tags square root

Objective

This article only to some common optimization methods to introduce and simple comparison, the detailed contents and formulas of various optimization methods have to seriously chew the paper, here do not repeat.

SGD

SGD refers to Mini-batch gradient descent, about batch gradient descent, stochastic gradient descent, and mini-batch gradient descent The specific difference will not elaborate. SGD now generally refers to mini-batch gradient descent.

SGD is the most common optimization method for every iterative calculation of the mini-batch gradient and then updating the parameters. That



Among them, the learning rate, is that the gradient sgd is completely dependent on the current batch gradient, so it can be understood to allow the current batch gradient to how much affect parameter updates

Cons: (just because of these shortcomings so many great gods have developed a variety of subsequent algorithms)

    • Choosing the right learning rate is difficult-use the same learning rate for all parameter updates. For sparse data or features, sometimes we might want to update faster for infrequently occurring features, for frequently occurring feature updates are slower, and SGD is less able to meet the requirements
    • SGD is easy to converge to local optimality, and in some cases may be trapped in the saddle point "originally written is" easily trapped in the saddle point ", after the review paper found that in fact, in the appropriate initialization and step size, the impact of the saddle point is not so big. Thanks for the @ Ice Orange.
Momentum

Momentum is the concept of simulating the momentum in physics, accumulating the previous momentum to replace the true gradient. The formula is as follows:



Which is the momentum factor

Characteristics:

    • At the beginning of the descent, use the last parameter update, the descent direction is consistent, multiply the larger can be good acceleration
    • When the mid-and late-fall, when the local minimum value back and forth, make the update amplitude increase, jump out of the trap
    • Ability to reduce updates when gradients change direction in summary, momentum can accelerate SGD in the relevant direction, suppress oscillations, and accelerate convergence
Nesterov

Nesterov a correction to the gradient update to avoid moving too fast and increasing sensitivity. Expand the formula in the previous section to:

As can be seen, there is no direct change in the current gradient, so the improvement of Nesterov is to let the previous momentum directly affect the current momentum. That





Therefore, after adding the Nesterov term, the gradient is calculated to correct the current gradient after a large jump. Such as:

Momentum first calculates a gradient (a short blue vector) and then makes a large jump (long blue vector) in the direction of accelerating the gradient, and the Nesterov first makes a big jump (a brown vector) in the previously accelerated gradient direction, calculates the gradient and then corrects (green ladder vector)

In fact, both the momentum and Nesterov items are designed to make the gradient update more flexible and targeted for different situations. However, the manual setting some of the learning rate is still somewhat blunt, and then introduce several methods of adaptive learning rate

Adagrad

Adagrad is actually a constraint on the learning rate. That



Here, a constraint is formed on a recursive form from 1 to a regularizer, to ensure that the denominator is not 0

Characteristics:

    • Early and smaller, regularizer larger, can enlarge gradient
    • Later in the larger time, the Regularizer is smaller and can constrain the gradient
    • Suitable for processing sparse gradients

Disadvantages:

    • As can be seen from the formula, it is still dependent on the manual setting of a global learning rate
    • If set too large, it will make the regularizer too sensitive, the adjustment of the gradient is too large
    • In the middle and later, the summation of the gradient squared on the denominator will be increasing, making the training end prematurely.
Adadelta

Adadelta is an extension of the Adagrad, the initial scheme is still adaptive to the learning rate constraints, but the computational simplification. Adagrad will accumulate all previous gradient squares, whereas adadelta only accumulate fixed-size items and do not store them directly, just approximate the corresponding average. That



At this point Adadelta is still dependent on the global learning rate, but the author has done some processing, after the approximate Newton iterative method:



Among them, the representative desires.

At this point, you can see that Adadelta is not dependent on the global learning rate.

Characteristics:

    • Training in junior high, the acceleration effect is good, soon
    • Jitter near the local minimum at the end of the training period
Rmsprop

Rmsprop can be counted as a special case of Adadelta:

At that time, it became the average of the sum of squares of gradients.

If you seek the root again, it becomes the RMS (mean square root):

At this point, this RMS can be used as a constraint on the learning rate:

Characteristics:

    • In fact, Rmsprop still relies on the global learning rate
    • Rmsprop is a kind of development of Adagrad, and Adadelta variant, the effect tends to be between the two
    • Suitable for non-stationary targets-good for rnn effect
Adam

Adam (Adaptive moment estimation) is essentially a rmsprop with a momentum term, which dynamically adjusts the learning rate of each parameter using the first-order moment estimation of the gradient and the second moment estimation. The advantage of Adam is that after biased correction, each iterative learning rate has a definite range, which makes the parameters more stable. The formula is as follows:









Among them, the first-order moment estimation of the gradient and the second-moment estimation, respectively, can be regarded as the estimate of the expectation, and is correct, which can approximate the unbiased estimation of the expectation. It can be seen that directly to the gradient of the moment estimation of memory no additional requirements, and can be dynamically adjusted according to the gradient, and the learning rate to form a dynamic constraint, and have a clear range.

Characteristics:

    • Combines the advantages of Adagrad's ability to handle sparse gradients and rmsprop to handle non-stationary targets
    • Low memory requirements
    • Calculate different adaptive learning rates for different parameters
    • Also suitable for most non-convex optimizations-for large datasets and high-dimensional spaces
Adamax

Adamax is a variant of Adam, which provides a simpler range for the upper limit of the learning rate. The changes in the formula are as follows:



It can be seen that the boundary range of Adamax learning rate is simpler

Nadam

Nadam is similar to Adam with the Nesterov momentum term. The formula is as follows:











It can be seen that Nadam has a stronger constraint on the learning rate, and also has a more direct impact on the updating of gradients. In general, you can use Nadam to achieve better results in areas where you want to use Rmsprop, or Adam.

Experience
    • For sparse data, try to use an adaptive optimization method for learning rate, without manual adjustment, and preferably with default values
    • SGD usually takes longer to train, but results are more reliable in the case of a good initialization and learning rate scheduling scenario
    • If you care about faster convergence and need to train deeper and more complex networks, it is recommended that you use an adaptive optimization approach to learning rates.
    • Adadelta,rmsprop,adam is a relatively similar algorithm, and behaves almost the same in similar cases.
    • Most can use Nadam to get better results if you want to use the Rmsprop, or Adam's place.

Finally show two can be a great picture, everything in the picture, Ah, the above is no use ...

Figure 5:SGD Optimization on loss surface contours

Figure 6:SGD Optimization on Saddle point

Reprint: Https://zhuanlan.zhihu.com/p/22252270?utm_source=qq&utm_medium=social

Deep interpretation of the most popular optimization algorithms: Gradient Descent (Lite version)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.