CS231N Course Note 6.1: Sgd,momentum,netsterov momentum,adagrad,rmsprop,adam__ algorithm for optimal iterative algorithm

Source: Internet
Author: User
Tags square root
cs231n Introduction

See cs231n Course notes 1:introduction.
Note: Italics are used to indicate the author's own thinking, correctness has not been validated, welcome advice. Optimized iterative algorithm

Write in front: Karpathy recommends Adam as the default algorithm, and if full batch, try L-bfgs (a second-order optimization algorithm, please search for details) if all noises are removed. About the implementation of the optimization algorithm please refer to cs231n job Note 2.3: Optimization algorithm momentum, Rmsprop, Adam and cs231n job Note 1.4: Random gradient descent (SGD). 1. SGD ( simple gradient descent Update stochastic gradient descent)

The simplest original iterative algorithm is to subtract the learning_rate* gradient value.

Stochastic's name seems to be in contrast to the full training set training method, each time using only a small portion of the training set (batch), please refer to optimization:stochastic gradient descent.
One of the problems with this algorithm is that it determines the step size according to the value of the gradient, so if the steps of each dimension vary greatly, it will continue to oscillate and converge slowly, as shown in the figure.
2. Momentum Momentum Method

As shown in the figure, consider the question of the update iteration as a physical problem, simulating the problem of converging the ball to the bottom. The gradient value corresponds to the Force (acceleration) of the ball at a certain point, and the amount of each update corresponds to the velocity of the ball at that time, and the parameter value corresponds to the height of the ball at that time (position, energy). Note that the force is not directly affecting the height, but by changing the speed, indirectly affect the height. The method of appeal simulation is also the origin of momentum.

The optimization algorithm is as follows:

where v corresponds to the velocity, the DX corresponds to the acceleration, and the x corresponds to the height. At the same time, in order to allow the ball to converge to the bottom, the energy loss mechanism is added, that is, each time on the basis of mu*v change. The value of Mu is usually 0.5,0.9,0.99. V is initialized to 0.
Answer the question in SGD, momentum by the speed of the modification of indirect impact on the height, so it has a delay and cumulative effect, effectively solve the problem of vibration. At the same time, because the speed of the dimension of the small gradient is accumulated, the convergence speed can be speeded up. 3. Nesterov Momentum Update

This algorithm is an improved algorithm for momentum, not in the current position of the gradient, but each time to move forward, using the future gradient values. As mentioned in the previous section, for the momentum algorithm, if you do not take into account the gradient value of the change of V, the next moment X will become x+mu*v (here is called the Forward step), where the gradient value should be more conducive to convergence (proved), as shown in the following figure.

The original formula is shown in the following illustration:

Notice here that the gradient value is no longer the gradient value of the current position, so it is incompatible with the interface of SGD and momentum, the original model. By converting the original formula to form, the interface can be made compatible. As shown in the following illustration:
4. Adagrad

Notice that the learning rate in the three algorithms of appeal are unchanged, so that the step size of each update changes little. But in practice, we need a high learning rate to converge near the optimal value at the beginning, then use the low learning rate to converge to the optimal value (because if the step is too large, the parameter will be in the vicinity of the best vibration and do not converge (the oscillation here can not be momentum) here we need to introduce an adaptive algorithm, The learning rate can be reduced reasonably in the iterative process, so the convergence effect is better.
Adagrad is the first adaptive algorithm, which makes the step monotonically decrease by dividing the squared sum of all the gradient values before the square root. Because the change of cache is related to the value of each dimension, this method can solve the problem that the gradient value of each dimension is different.

Note that in the Adagrad algorithm, the learning rate is monotonically decreasing, sometimes the way to adjust the step is too greedy, that is, may make the learning rate prematurely reduced, and finally stay in the most advantages of the position. Rmsprop is an improved version of Adagrad. 5. Rmsprop

Simply put, the improvement of Rmsprop for Adagrad is the cache Update method. Unlike the constant cumulative gradient squared, the leak mechanism is introduced into the Rmsprop, which causes the cache to lose a portion each time, thus making the step length no longer monotonically decreasing.
6.Adam

Adam is the default iterative algorithm recommended by Karpathy in class, which can be understood as the fusion of momentum and Rmsprop, and the error correction is introduced to deal with the problems of M and V being too small in the initial iterations.
The specific algorithm is shown in the following figure. (The equation in the following figure is inconsistent with the original paper, the correct method is shown below) where beta1 is usually set to 0.9, and Beta2 is usually set to 0.995.

The equation in the illustration above is inconsistent with the original paper, the correct method is shown in the following figure, as detailed in Adam:a methods for stochastic optimization. The difference has two points:
1. Bias correction section, the resulting corrected m and V are only responsible for current step updates and are not propagated to subsequent loops, i.e., M and V are not updated.
2. The update time point for T is the beginning of the loop, for example when the first loop is t=1.
7. Adjust Learning rate

Learning Rate (learning rate) is the public super parameter of all algorithms, and its size directly affects the result of the model. As the following figure shows, the appropriate learning rate can neither be too high nor too low.

And in different stages of training may require different sizes of learning rate. So in addition to adopting adaptive optimization algorithm, this paper introduces an explicit learning rate degradation algorithm. In essence, in three different ways, as the number of iterations increases, the learning rate is reduced. The details are shown in the following illustration:

The independent variable t is usually the number of iterations, or it can be iteration or epoch.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.