Objective
This article only to some common optimization methods to introduce and simple comparison, the detailed contents and formulas of various optimization methods have to seriously chew the paper, here do not repeat.
SGD
SGD refers to Mini-batch gradient descent, about batch gradient descent, stochastic gradient descent, and mini-batch gradient descent The specific difference will not elaborate. SGD now generally refers to mini-batch gradient descent.
SGD is the most common optimization method for every iterative calculation of the mini-batch gradient and then updating the parameters. That
Among them, the learning rate, is that the gradient sgd is completely dependent on the current batch gradient, so it can be understood to allow the current batch gradient to how much affect parameter updates
Cons: (just because of these shortcomings so many great gods have developed a variety of subsequent algorithms)
- Choosing the right learning rate is difficult-use the same learning rate for all parameter updates. For sparse data or features, sometimes we might want to update faster for infrequently occurring features, for frequently occurring feature updates are slower, and SGD is less able to meet the requirements
- SGD is easy to converge to local optimality, and in some cases may be trapped in the saddle point "originally written is" easily trapped in the saddle point ", after the review paper found that in fact, in the appropriate initialization and step size, the impact of the saddle point is not so big. Thanks for the @ Ice Orange.
Momentum
Momentum is the concept of simulating the momentum in physics, accumulating the previous momentum to replace the true gradient. The formula is as follows:
Which is the momentum factor
Characteristics:
- At the beginning of the descent, use the last parameter update, the descent direction is consistent, multiply the larger can be good acceleration
- When the mid-and late-fall, when the local minimum value back and forth, make the update amplitude increase, jump out of the trap
- Ability to reduce updates when gradients change direction in summary, momentum can accelerate SGD in the relevant direction, suppress oscillations, and accelerate convergence
Nesterov
Nesterov a correction to the gradient update to avoid moving too fast and increasing sensitivity. Expand the formula in the previous section to:
As can be seen, there is no direct change in the current gradient, so the improvement of Nesterov is to let the previous momentum directly affect the current momentum. That
Therefore, after adding the Nesterov term, the gradient is calculated to correct the current gradient after a large jump. Such as:
Momentum first calculates a gradient (a short blue vector) and then makes a large jump (long blue vector) in the direction of accelerating the gradient, and the Nesterov first makes a big jump (a brown vector) in the previously accelerated gradient direction, calculates the gradient and then corrects (green ladder vector)
In fact, both the momentum and Nesterov items are designed to make the gradient update more flexible and targeted for different situations. However, the manual setting some of the learning rate is still somewhat blunt, and then introduce several methods of adaptive learning rate
Adagrad
Adagrad is actually a constraint on the learning rate. That
Here, a constraint is formed on a recursive form from 1 to a regularizer, to ensure that the denominator is not 0
Characteristics:
- Early and smaller, regularizer larger, can enlarge gradient
- Later in the larger time, the Regularizer is smaller and can constrain the gradient
- Suitable for processing sparse gradients
Disadvantages:
- As can be seen from the formula, it is still dependent on the manual setting of a global learning rate
- If set too large, it will make the regularizer too sensitive, the adjustment of the gradient is too large
- In the middle and later, the summation of the gradient squared on the denominator will be increasing, making the training end prematurely.
Adadelta
Adadelta is an extension of the Adagrad, the initial scheme is still adaptive to the learning rate constraints, but the computational simplification. Adagrad will accumulate all previous gradient squares, whereas adadelta only accumulate fixed-size items and do not store them directly, just approximate the corresponding average. That
At this point Adadelta is still dependent on the global learning rate, but the author has done some processing, after the approximate Newton iterative method:
Among them, the representative desires.
At this point, you can see that Adadelta is not dependent on the global learning rate.
Characteristics:
- Training in junior high, the acceleration effect is good, soon
- Jitter near the local minimum at the end of the training period
Rmsprop
Rmsprop can be counted as a special case of Adadelta:
At that time, it became the average of the sum of squares of gradients.
If you seek the root again, it becomes the RMS (mean square root):
At this point, this RMS can be used as a constraint on the learning rate:
Characteristics:
- In fact, Rmsprop still relies on the global learning rate
- Rmsprop is a kind of development of Adagrad, and Adadelta variant, the effect tends to be between the two
- Suitable for non-stationary targets-good for rnn effect
Adam
Adam (Adaptive moment estimation) is essentially a rmsprop with a momentum term, which dynamically adjusts the learning rate of each parameter using the first-order moment estimation of the gradient and the second moment estimation. The advantage of Adam is that after biased correction, each iterative learning rate has a definite range, which makes the parameters more stable. The formula is as follows:
Among them, the first-order moment estimation of the gradient and the second-moment estimation, respectively, can be regarded as the estimate of the expectation, and is correct, which can approximate the unbiased estimation of the expectation. It can be seen that directly to the gradient of the moment estimation of memory no additional requirements, and can be dynamically adjusted according to the gradient, and the learning rate to form a dynamic constraint, and have a clear range.
Characteristics:
- Combines the advantages of Adagrad's ability to handle sparse gradients and rmsprop to handle non-stationary targets
- Low memory requirements
- Calculate different adaptive learning rates for different parameters
- Also suitable for most non-convex optimizations-for large datasets and high-dimensional spaces
Adamax
Adamax is a variant of Adam, which provides a simpler range for the upper limit of the learning rate. The changes in the formula are as follows:
It can be seen that the boundary range of Adamax learning rate is simpler
Nadam
Nadam is similar to Adam with the Nesterov momentum term. The formula is as follows:
It can be seen that Nadam has a stronger constraint on the learning rate, and also has a more direct impact on the updating of gradients. In general, you can use Nadam to achieve better results in areas where you want to use Rmsprop, or Adam.
Experience
- For sparse data, try to use an adaptive optimization method for learning rate, without manual adjustment, and preferably with default values
- SGD usually takes longer to train, but results are more reliable in the case of a good initialization and learning rate scheduling scenario
- If you care about faster convergence and need to train deeper and more complex networks, it is recommended that you use an adaptive optimization approach to learning rates.
- Adadelta,rmsprop,adam is a relatively similar algorithm, and behaves almost the same in similar cases.
- Most can use Nadam to get better results if you want to use the Rmsprop, or Adam's place.
Finally show two can be a great picture, everything in the picture, Ah, the above is no use ...
Figure 5:SGD Optimization on loss surface contours
Figure 6:SGD Optimization on Saddle point
Reprint: Https://zhuanlan.zhihu.com/p/22252270?utm_source=qq&utm_medium=social
Deep interpretation of the most popular optimization algorithms: Gradient Descent (Lite version)