One of the most commonly used optimizations in machine learning--a review of gradient descent optimization algorithms

Source: Internet
Author: User
Tags shuffle

Transferred from: http://www.dataguru.cn/article-10174-1.html

Gradient descent algorithm is a very extensive optimization algorithm used in machine learning, and it is also the most commonly used optimization method in many machine learning algorithms. Almost every current advanced (State-of-the-art) machine Learning Library or deep learning library includes different variants of the gradient descent algorithm. However, they are like a black box optimizer, and it's hard to get a practical explanation of their pros and cons. This article is intended to provide an introduction to the different variants of the gradient descent algorithm, which helps users to use them according to their specific needs.   This article first introduces the three frameworks of the gradient descent algorithm, then introduces the problems and challenges they present, then describes how to improve them to solve these problems, and then describes how to use the gradient descent algorithm in a parallel environment or in a distributed environment. Finally, the paper points out some favorable strategies for gradient descent.   Catalogue   Three gradient descent optimization frameworks     batch gradient descent             random gradient drop         &N Bsp   Small batch gradient descent problem and challenge gradient descent optimization algorithm           momentum          Nesterov Accelera Ted gradient          adagrad          adadelta      &NBS P   rmsprop          Adam algorithm to visualize which optimization algorithm to choose? Parallel and distributed sdg          hogwild!          Downpour sgd      &N Bsp   Delay-tolerant algorithms for sgd          tensorflow          EL Astic averaging SGD more SDG optimization strategies          Training set shuffle and course learning           Batch normalization           E Arly stopping          Gradient Noise Summary reference   Three gradient descent optimization framework gradient descent algorithm is passed along the target function J (θ) The gradient of the parameter θ∈r (first derivative) in the opposite direction?? Θj (θ) is constantly updating the model parameters to the minimum point (convergence) of the target function, and the update step is η.   has three gradient descent algorithm frameworks, which differ in the number of samples used per learning (updating model parameters), and each update using a different sample results in different accuracy and learning time for each study.   Batch gradient drop (batch gradient descent)    update the model parameters with the full amount of training set samples each time, namely: Θ=θ?η?? Θj (θ)   its code is as follows:

Epochs is the maximum number of iterations that the user has entered. As can be seen from the appeal code, the gradient Params_grad of the loss function loss_function is computed each time using the entire training set sample, then using the learning rate learning_rate to update each parametric params of the model in the opposite direction of the gradient. In general, some of the existing machine learning libraries provide a gradient computing API.  If you want to write code calculations yourself, you need to verify that the gradient calculations are correct during program debugging. Batch gradient descent each study uses the entire training set, so the advantage is that each update will be in the right direction, and finally can be guaranteed to converge to the extremum point (convex function convergence to the global extreme point, non-convex function may converge to local extremum point), but its disadvantage is that each study time is too long, And if the training set is large enough to consume a large amount of memory, and the full gradient drop cannot be updated on the online model parameters. Random gradient descent (Stochastic gradient descent) random gradient descent algorithm randomly selects a sample from the training set to study, namely: Θ=θ?η?? The Θj (θ;xi;yi) batch gradient descent algorithm uses all training samples every time, so these calculations are redundant because the exact same set of samples is used each time. The random gradient descent algorithm randomly selects only one sample at a time to update the model parameters, so each study is very fast and can be updated online. The code is as follows:

The disadvantage of the random gradient descent is that each update may not be in the correct direction, so it can lead to optimal fluctuations (perturbations) such as:

Figure 1 SGD perturbation however, in another way, the fluctuation of the random gradient descent has the advantage that for similar basin regions (that is, many local minimum points) the characteristics of this fluctuation may cause the direction of optimization to jump from the current local minimum point to another better local minimum point,  Thus, it is possible for a non-convex function to converge to a better local extremum point or even a global extremum point. As a result of fluctuations, the number of iterations (number of studies) increases, i.e. the convergence rate slows down. In the end, it will be the same convergence as the whole gradient descent algorithm, that is, the convex function converges to the global extremum point, and the non-convex loss function converges to the local extremum point. Low-volume gradient descent (Mini-batch gradient descent) Mini-batch gradient drop combines batch gradient descent with stochastic gradient descent, which balances each update speed with the number of updates Each update from the training set randomly selected M,m<n samples to learn, namely: Θ=θ?η?? Θj (Θ;XI:I+M;YI:I+M) has the following code:

The mini-batch gradient decreases with respect to the stochastic gradient, which reduces the variance of parameter updating and makes the update more stable. It increases the speed of each study relative to the total gradient drop. And it does not have to worry about memory bottlenecks so that it can be efficiently calculated using matrix operations. In general, each update randomly selects [50,256] samples for learning, but also to be based on specific problems to choose, in practice, you can do many experiments, choose a update speed and shepherds times are more suitable for the number of samples. Mini-batch gradient descent can guarantee convergence and is often used in neural networks. Problems and challenges while the gradient descent algorithm works well and is widely used, it also has some challenges and problems to solve: it is difficult to choose a reasonable learning rate. If the learning rate is too small, it can cause a slow convergence rate. If the learning rate is too high, it will hinder convergence, that is, near the extreme point will oscillate. Learning rate adjustment (also known as learning rate scheduling, learning rates schedules) [11] attempts to change the learning rate, such as annealing, during each update. It is common to use a predetermined strategy or to attenuate a smaller threshold value in each iteration. Regardless of the adjustment method, a fixed setting is required beforehand, and there is no way to adapt each learning DataSet feature [10]. All parameters of the model are updated with the same learning rate every time. If the data characteristics are sparse or each feature has a different statistic feature and space, then it is not possible to use the same learning rate for each parameter in each update, and those features that rarely appear should use a relatively large learning rate. For non-convex objective functions, it is easy to fall into those suboptimal local extreme points, such as in neural networks. So how to avoid it. DAUPHIN[19] points out that the more serious problem is not the local extremum point, but the saddle point. Gradient descent optimization algorithm some of the gradient optimization methods that are often used in deep learning communities to solve appeal problems are discussed below, but not in high-dimensional data, such as Newton's method. Momentum if in the Canyon area (some directions are much steeper than others, common to local extremum points) [1],SGD will oscillate around these places, resulting in slow convergence. In this case, momentum (Momentum) can be solved [2]. Momentum adds an update amount (that is, the momentum term) to the parameter update item, namely: Νt=γνt?1+η?θj (θ), θ=θ?νt where the momentum item Hyper-parameter γ<1 is generally less than or equal to 0.9. The function is as follows:

Figure 2 No momentum

Figure 3 Adding momentum plus momentum is like rolling a ball from the top of the mountain and rolling it down to accumulate the front momentum (momentum is increasing), so the speed becomes faster and quicker until it reaches the end point. Similarly, when updating model parameters, for those parameters where the current gradient direction is the same as the previous gradient direction, the reinforcement is faster in these directions, and for those parameters that are different from the previous gradient direction for the current gradient direction, then the reduction is slowed in these directions. This allows for faster convergence and less oscillation. Nesterov accelerated Gradient (NAG) the ball that rolls down from the top of the mountain will blindly choose the ramp.  The better way should be to slow down before encountering a tilt upward. Nesterov Accelerated Gradient (NAG, Nesterov gradient acceleration) Not only increases the momentum term, but also subtracts the momentum term in the loss function when calculating the gradient of the parameter, that is, the calculation? Θj (θ?γνt?1), which predicts where the next parameter is located. namely: Νt=γνt?1+η?? Θj (θ?γνt?1), θ=θ?νt as shown:

Figure 4 Nag Update   Detailed description can be found in Ilya Sutskever's PhD thesis [9]. Assuming that the momentum factor parameter is γ=0.9, first calculate the current gradient term, such as the small blue vector, and then add the momentum term, so that you get a big jump, such as a large blue vector. This is an update that contains only momentum items. The nag first comes with a big jump (momentum term), and then adds a small one that uses the current gradient of the momentum calculation (the red vector) to be corrected to get the green vector. This prevents too fast updates to improve responsiveness, as in Rnns [8].    through the above two methods, each learning process can be based on the slope of the loss function to achieve adaptive update to accelerate the convergence of SGD. The next step is to adapt each parameter to its own adaptive update based on the importance of the parameters. &NBSP;ADAGRADADAGRAD[3] is also a gradient-based optimization algorithm, it can adapt to each parameter to different learning rates, sparse features, get large learning updates, non-sparse features, get smaller learning updates, so the optimization algorithm is suitable for processing sparse feature data. Dean et [4] found that Adagrad was able to improve the robustness of SGD, and Google used it to train large-scale neural networks (see Cat: Recognize cats in Youtube videos). Pennington and so on [5] in the glove use Adagrad to train the word vector (word embeddings), the frequent occurrence of words to give smaller updates, the words do not often appear to give greater updates. The main advantage of   adagrad is that it can adapt to different learning rates for each parameter, while the General manual is set to 0.01. At the same time, the disadvantage lies in the need to calculate the parameter gradient sequence squared sum, and the learning rate trend is constant attenuation eventually reached a very small value. The following adadelta is used to solve the problem.  adamadaptive moment Estimation (Adam) is also a different parameter adaptive learning rate method, the difference between Adadelta and Rmsprop is that it calculates the historical gradient attenuation in different ways, without using the historical square attenuation, The attenuation is similar to the momentum, as follows:  mt=β1mt?1+ (1?β1) gtvt=β2vt?1+ (1?BETA2) G2T&NBSP;MT and VT respectively is the gradient of the weighted average and the right to have a prescription difference, the initial 0 vectors, Adam's authors found that they tended to 0 vectors (close to 0 vectors), especially at the attenuation factor (attenuation rate) β1,β2 close to 1 o'clock. To improve this problem, error correction of MT and VT (bias-corrected): mt^=mt1?betat1vt^=vt1?betat2  Finally, Adam's update equation is:  θt+1=θt?ηvt^?? The default value is recommended in √+?mt^  paper: Β1=0.9,β2=0.999,?=10?8. In this paper, Adam and some other adaptive learning rates are compared, the effect is better. Visualization of the   algorithm the following two images visually compare the above optimization methods,   Figure 5 The performance of the various optimization methods of SGD on the loss surface    from what can be seen, Adagrad, The Adadelta and Rmsprop can be quickly transferred to the correct direction of movement in the loss surface to achieve rapid convergence. Momentum and nag can lead to deviations (off-track). At the same time, the NAG can quickly revise its route after deviating because it improves responsiveness based on gradient correction.  

Figure 6 The performance of the SGD optimization method at the saddle point of the loss surface   as can be seen at the saddle point (saddle points) (that is, some dimensions of the gradient is zero, some dimensions on the gradient is not 0), SGD, momentum and nag have been in the direction of the saddle point gradient zero oscillation, It is difficult to break the symmetry of the saddle point position, and the Adagrad, Rmsprop and Adadelta can quickly transfer to the non-zero direction of the gradient.    from the two images above, the adaptive Learning Rate method (Adagrad, Adadelta, Rmsprop, and Adam) has better convergence rate and convergence in these scenarios.   How to select the SGD optimizer if your data characteristics are sparse, then you'd better use the adaptive learning Rate SGD optimization method (Adagrad, Adadelta, Rmsprop, and Adam) because you don't need to manually adjust the learning rate during the iteration.   rmsprop is an extension of Adagrad, similar to Adadelta, but the improved version of Adadelta uses RMS to automatically update the learning rate and does not need to set the initial learning rate. Adam is using momentum and deviation correction on the basis of Rmsprop. Rmsprop, Adadelta, and Adam behaved similarly in similar situations. KINGMA[15] points out that the proceeds are offset corrections, and Adam is slightly better than rmsprop because the gradient becomes more sparse when it approaches convergence. Therefore, Adam may be the best method of SGD optimization currently.    Interestingly, many recent papers have used the original SGD gradient descent algorithm, and have used a simple learning rate annealing adjustment (no momentum term). It has been shown that SGD can converge to the minimum point, but it may take longer than other SGD, and it relies on robust initial values and learning rate annealing adjustment strategies, and is prone to local minima or even saddle points. Therefore, if you are concerned about the speed of convergence or training a deep or complex network, you should choose an adaptive learning rate for the SGD optimization method.   Parallel and distributed SGD if you are working with datasets that are very large and have a machine cluster to take advantage of, then parallel or distributed SGD is a very good choice because it can greatly improve the speed. The nature of the SGD algorithm determines that it is serial (step-by-step). So how to do asynchronous processing is a problem. While the serial can guarantee convergence, speed is a bottleneck if the training set is large. If an asynchronous update occurs, it may cause non-convergence. The following will discuss how to do parallel or distributed SGD, which generally refers to multi-core parallelism on the same machine, and distributed refers to cluster processing. &nbsP HOGWILD&NBSP;NIU[23] presents a parallel SGD method called Hogwild. This method is parallel at multiple CPU times. The processor accesses the parameters through shared memory, and these parameters are not locked. It allocates a portion of the parameter that does not overlap for each CPU (assigning mutexes), and each CPU only updates its responsible parameters. This method is only suitable for processing data characteristics that are sparse. This approach can achieve an optimal convergence rate, because the same information is not rewritten between CPUs.  downpour Sgd downpour SGD is an asynchronous variant of SGD used by dean[4 in Distbelief (the predecessor of Google TensorFlow). It trains multiple copies of the model at the same time on the training subset. These replicas send their respective updates to the parameter server (Ps,parameter server), and each parameter server updates only a subset of the parameters that are mutually exclusive and does not communicate between replicas. This may result in divergence of parameters and unfavorable convergence.  delay-tolerant algorithms for Sgd mcmahan and streeter[12] extended Adagrad, by developing a delay tolerance algorithm (delay-tolerant Algorithms), The algorithm not only adapts to the past gradients, but also updates the delay. This method has been shown to be effective in practice. &NBSP;TENSORFLOW&NBSP;TENSORFLOW[13] is a large-scale machine learning Library of Google Open Source, formerly known as Distbelief. It has been used in a large number of mobile devices or large-scale distributed clusters, has been tested in practice. Its distributed implementation is based on graph computation, which divides the graph into multiple sub-graphs, each of which is used as a compute node in the graph, and they communicate through rend/receive.  elastic averaging Sgd zhang et [14] proposed Elastic averaging SGD (EASGD), which passed a Elastic force ( Store parameters for the parameter Server center) to connect each work to make parameters asynchronous updates.   More SGD optimization strategies The next step is to introduce more SGD optimization strategies to further improve the performance of SGD. There are also a number of other optimization strategies, which can be found in [22].  shuffling and Curriculum learning  in order to make the learning process more unbiased, the sample in the training set should be randomly disrupted in each iteration.    on the other hand, in many cases, Iis to solve the problem gradually, and the training set in a meaningful order will improve the performance of the model and the convergence of SGD, how to set up a meaningful arrangement of the training set is called Curriculum learning[16].   zaremba and sutskever[17] are using curriculum learning to train LSTMS to solve some simple problems, It shows that a combined strategy or a hybrid strategy is better than an incremental sort of training set according to the difficulty of training.  batch normalization  in order to facilitate training, we usually initialize the parameters according to the 0 mean 1 variance, with constant training, the parameters are updated to varying degrees, so that these parameters will lose 0 mean 1 variance distribution properties, This reduces the speed of training and changes in magnification parameters as the network structure deepens.   batch normalization[18] normalizes the parameter by 0 mean 1 variance after each mini-batch reverse propagation. This allows for greater learning rates and less effort on parameter initialization points. Batch normalization acts as a necessity for regularization, reduction, and even elimination of dropout.  early stopping  on the validation set if the loss function is no longer significantly reduced during successive iterations, you should end the training in advance, see Nips Tutorial Slides in detail, or see some methods to prevent overfitting.  gradient noise gradient noise[21] The random error of a Gaussian distribution N (0,Σ2T) is added to each iteration of the computed gradient, i.e.  gt,i=gt,i+n (0,σ2t)   The variance of the Gaussian error requires annealing:  σ2t=η (1+t) γ  increasing the random error on the gradient increases the robustness of the model, even if the initial parameter values are not chosen well and is suitable for training in a particularly deep-seated network. The reason for this is that increasing random noise is more likely to jump over local extreme points and find a better local extremum, which is more common in deep networks.   Summary in the above, three kinds of frameworks for gradient descent algorithm are introduced, and mini-batch gradient descent is the most widely used. We then focused on some of the optimization methods for SGD: Momentum, NAG, Adagrad, Adadelta, Rmsprop and Adam, and some asynchronous SGD methods. Finally, some other optimization models to improve the performance of SGD are introduced.Such as: Training set Shuffle and course learning (shuffling and curriculum learning), batch normalization, early stopping and Gradient noise.    

One of the most commonly used optimizations for machine learning--a summary of gradient descent optimization algorithms

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.