SGD, Momentum, Rmsprop, Adam differences and connections

Source: Internet
Author: User

Reprint Address: https://zhuanlan.zhihu.com/p/32488889

optimization algorithm Framework: calculates the gradient of the target function on the current parameter: calculates the first and second-order momentum based on the historical gradient: calculates the descending gradient of the current moment: updates based on the descent gradient:

The most important difference is the downward direction of the third step, in which the first half is the actual learning rate (i.e., the descending step), and the second part is the actual descent direction. Different optimization algorithms are constantly on the two parts of the fuss.

The simplest optimization algorithm is SGD, there is no momentum and adaptive learning rate concept, but there are still a lot of people in use.


SGD

gradient Update rule:

The simplest form of SGD

A problem exists:

Because the update is more frequent, will cause the cost function has the serious concussion, finally stays at the local minima or saddle point.

In order to be able to jump out of local minima and saddle point, the concept of momentum is proposed.


SGD with Momentum

gradient Update rule:

Momentum added inertia in the gradient descent process, so that the gradient direction unchanged in the dimension of the speed, the gradient direction changes in the dimension of the update speed is slow, so that can accelerate convergence and reduce shocks.

The first-order momentum is the moving average, where the empirical value is 0.9, meaning that the main descending direction of the moment T is determined by the direction of the t-1 moment's descent and the bias of a little T-moment.

There is a problem: there are some prophets, such as going uphill, know the need to slow down, adaptability will be better, not according to the importance of parameters to different degrees of updating the parameters.


SGD with Nesterov acceleration

Nag improves the first problem with SGDM, when calculating gradients, not at the current position, but in the future. So here the gradient is followed by the cumulative momentum of a step after the gradient, that is,

We want to be able to update the different parameters according to the importance of the parameters, and the learning rate is adaptive. For frequently updated parameters, we have accumulated a lot of knowledge about it, do not want to be affected by a single sample is too large, hope that the learning rate is slower, for the occasional update of the parameters, we know too little information, we hope to learn more from every accidental sample, that the learning rate is larger.


Adagrad

gradient Update rule:

The second-order momentum is the sum of the squares of all the gradient values so far in the dimension

To avoid a denominator of 0, a random perturbation is added


A problem exists:

The denominator accumulates so that the learning rate shrinks and eventually becomes very small.


Rmsprop

gradient Update rule:

To solve the problem of Adagrad learning rate, Rmsprop changed the second-order momentum calculation method, that is, the second momentum is calculated by the window sliding weighted average.

Hinton is recommended to be set to 0.9, with a learning rate of 0.001.

The optimal algorithm for the integrated Momentum + adaptive learning rate should be the best in the present.


Adam

gradient Update rule:

Adam = Adaptive + Momentum, as the name implies, Adam integrates the first-order momentum of SGD and the second momentum of Rmsprop.

The two most common parameters in the optimization algorithm are here, the former controls the first-order momentum, and the latter controls the second momentum.

If you add a Nesterov acceleration on Adam's base, it's more of a nadam. 5.34c-.314.314-.3.846.03 1.177z "fill-rule=" EvenOdd ">

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.