First, Introduction
In many machine learning and depth learning applications, we find that the most used optimizer is Adam, why?
The following is the optimizer in TensorFlow:
See also for details: Https://www.tensorflow.org/api_guides/python/train
In the Keras also have Sgd,rmsprop,adagrad,adadelta,adam, details: https://keras.io/optimizers/
We can find that in addition to the common gradient drop, there are several adadelta,adagrad,rmsprop and other optimizer, what is it, how to choose.
Second, the optimizer algorithm brief
First, let's look at the three most common variants of gradient bgd,sgd,mbgd,
The difference between these three forms depends on how much data we use to compute the gradient of the objective function,
This naturally involves a trade-off, that is, the accuracy of the parameter update and the running time.
1. Batch Gradient Descent
Gradient Update rule:
BGD uses the data from the entire training set to compute the gradient of the cost function on the parameter:
Disadvantages:
Because this method is in an update, the entire dataset to compute the gradient, so the calculation is very slow, it is very difficult to encounter a large number of datasets, and can not put new data real-time update model
For I in Range (Nb_epochs):
Params_grad = evaluate_gradient (loss_function, data, params)
params = Params-learnin G_rate * Params_grad
We will define an iterative number of times epoch, first compute the gradient vector params_grad, and then update the parameters along the direction of the gradient params,learning rate determines how big each step we take.
Batch gradient descent can converge to global minima for convex functions and converge to local minima for non convex functions.
2. Stochastic gradient descent
Gradient Update rule:
A gradient of each sample is updated each time the SGD is updated, compared to the BGD of all data computed gradients.
For a large dataset, there may be similar samples, so that BGD will appear redundant when calculating gradients,
The SGD is only one update at a time, there is no redundancy, and relatively fast, and can be new samples.
For I in Range (Nb_epochs):
np.random.shuffle (data) for
example in data:
Params_grad = Evaluate_gradient ( Loss_function, example, params)
params = params-learning_rate * Params_grad
Looking at the code, you can see that the difference is that the whole dataset is a loop with a parameter update for each sample.
Disadvantages: But SGD because the update is more frequent, can cause the cost function to have the serious concussion.
The BGD can converge to a local minimum, and of course a SGD oscillation may jump to a better local minimum value.
When we slightly reduce the convergence of learning RATE,SGD and BGD is the same.
3. Mini-batch Gradient Descent
Gradient Update rule: MBGD each time using a small batch of samples, that is, n samples for calculation,
So it can reduce the variance of the parameter update, the convergence is more stable,
On the other hand, we can make full use of the highly optimized matrix operation in the depth learning library to do more effective gradient calculation.
The difference with SGD is that each time the loop does not work on each sample, it is a batch with n samples
For I in Range (Nb_epochs):
np.random.shuffle (data)
to batch in get_batches (data, batch_size=50):
params_ Grad = evaluate_gradient (loss_function, batch, params)
params = params-learning_rate * Params_grad
Super parameter Set Value: N General value in 50~256
Disadvantages: However, mini-batch gradient descent does not guarantee good convergence: Learning rate If the selection is too small, convergence speed will be very slow, if too large, loss function will be in the minimum constant vibration or even deviation.
(one measure is to set a larger learning rate first, when the change between two iterations is below a certain threshold, the learning rate is reduced, but the threshold setting needs to be written in advance so that it cannot be adapted to the characteristics of the dataset. In addition, this method applies the same for all parameter updates Learning rate, if our data is sparse, we'd like to make a larger update on features with low frequencies. Also, for non convex functions, avoid being stuck in local minima or saddle points because the error around the saddle points is the same, and all dimensions are close to 0,SGD and are easily trapped here.
Saddle point is: a smooth function of the saddle point adjacent to the curve, surface, or hyper-surface, are located in this point of the different sides of the tangent.
For example, this two-dimensional figure, like a saddle: in the X-axis direction upward, in the y-axis direction down, the saddle point is (0,0)
To address these three challenges, the following algorithms are available.
4. Momentum
SGD in the case of ravines easy to be trapped, ravines is the surface of One direction is steeper than the other direction, then the SGD will be shaken and can not close to the minimum value:
Gradient Update rule:
Momentum by adding γv_t−1, you can accelerate SGD and suppress turbulence
When we roll a small ball down the hill, the momentum will increase when there is no resistance, but the speed will be smaller if the resistance is encountered.
By adding this, you can speed up the gradient in the invariant dimension and slow down the change in the gradient direction, which can accelerate convergence and reduce turbulence.
Super Parameter Set Value: General γ takes value about 0.9.
Disadvantage: This situation is the same as when a small ball rolls down the hill blindly along the slope, if it can have some prophets, such as going uphill, know need to slow down, adaptability will be better.
5. Nesterov Accelerated Gradient
Gradient Update rule:
Using θ−γv_t−1 to approximate the value of the next step as a parameter, the gradient is calculated not in the current position but in the future position.
Super parameter Set Value: Γ still takes about 0.9 of the value.
Effect comparison:
Blue is a momentum process that calculates the current gradient and then has a big jump after the updated cumulative gradient.
The NAG will first have a big jump on the cumulative gradient (brown vector) in the previous step and then measure the gradient to do the correction (red vector), this expected update can avoid us going too fast.
NAG allows RNN to perform better on many tasks.
So far, we have been able to adjust the speed by adapting the gradient of the loss function while updating the gradient and accelerating the SGD.
We also hope that different parameters can be updated to varying degrees depending on the importance of the parameters.
6. Adagrad
This algorithm can make a big update to the low-frequency parameters, do a small update to the high-frequency, and therefore, for the sparse data it performance is good, improve the robustness of SGD, such as the identification of the cat inside the YouTube video, training glove word embeddings, Because they all need to have a greater update on the characteristics of the low-frequency.
Gradient Update rule:
Where G is: T-time parameter θ_i gradient
If it is a normal SGD, then the gradient update formula for θ_i at each moment is:
But here the learning rateη also change with T and I:
Where the g_t is a diagonal matrix, the (i,i) element is the gradient squared sum of the T-time parameter θ_i.
The advantage of Adagrad is that it reduces the manual adjustment of learning rate.
Super Parameter Set Value: General η to take 0.01.
Disadvantage: Its disadvantage is that the denominator will accumulate, so that the learning rate will shrink and eventually become very small.
7. Adadelta
This algorithm is an improvement on Adagrad, compared to Adagrad, where the denominator of G is replaced by the decay mean of the previous gradient squared:
This denominator corresponds to the gradient RMS root mean squared (RMS), so it can be abbreviated with the following:
Where E is calculated as follows, T time depends on the average and current gradient of the previous moment:
Gradient Update rule:
In addition, the learning rate η changed to rms[δθ], so that we do not even need to set the learning rate in advance:
Parameter set value: Γ is generally set to 0.9.
8. Rmsprop
Rmsprop is an adaptive learning rate method proposed by Geoff Hinton.
Rmsprop and Adadelta are all in order to solve the problem of the sharp decline of adagrad learning rate,
Gradient Update rule: Rmsprop is the same as the first form of Adadelta
Parameter set Value: Hinton recommended that γ be set to 0.9 and the learning rate η 0.001.
9. Adam
This algorithm is another method for calculating the adaptive learning rate of each parameter.
In addition to storing the exponential attenuation average of the square VT of the past gradient, like Adadelta and Rmsprop, the exponential attenuation mean of the gradient Mt was maintained as momentum:
If MT and VT are initialized to 0 vectors, they will be biased to 0, so the deviation correction is done,
These deviations are offset by the calculation of MT and VT after deviation correction:
Gradient Update rule:
Super parameter Set Value: Suggested β1 = 0.9,β2 = 0.999,ϵ= 10e−8
The practice shows that Adam has better effect than other adaptive learning methods.
10. Effect comparison
Let's look at the performance of several algorithms at the saddle Point and the contour line:
Both of the above can be seen, Adagrad, Adadelta, Rmsprop almost soon found the right direction and forward, convergence speed is also very fast, and other methods are either very slow, or go a lot of detours to find.
The adaptive learning rate method, Adagrad, Adadelta, Rmsprop, and Adam are more suitable and more convergent in this scenario.
Iii. How to choose if the data is sparse, use the method of self application, namely Adagrad, Adadelta, Rmsprop, Adam. Rmsprop, Adadelta, Adam in many cases the effect is similar. Adam was added bias-correction and momentum on the basis of Rmsprop, and Adam was better than Rmsprop as the gradient became sparse. On the whole, Adam is the best choice. Many papers will use SGD, no momentum and so on. Although SGD can reach a minimum, it takes longer than other algorithms and may be trapped in a saddle point. If the need for faster convergence, or to train deeper and more complex neural networks, need to use an adaptive algorithm.