Adam optimization algorithm

Source: Internet
Author: User

Deep learning often requires a great deal of time and computer resources to train, which is also a major reason for the development of deep learning algorithms. While we can use distributed parallel training to accelerate the learning of models, the computational resources required are not reduced in the slightest. But only need less resources, make the model convergence faster optimization algorithm, can fundamentally accelerate the learning speed and effect of the machine, Adam algorithm is born for this!

Adam optimization algorithm is an extension of the random gradient descent algorithm, which comes into its extensive application and deep learning applications, especially computer vision and natural language processing tasks. This paper is divided into two parts, the first part briefly introduces the characteristics of Adam optimization algorithm and its application in deep learning, the latter part of the Adam optimization algorithm from the original paper, the detailed explanation and derivation of his algorithm process and update rules, we hope that the readers in Su-wan two parts can understand the following points:

1) What is the ADAM algorithm and what advantages he brings to optimizing the deep learning model

2) What is the principle mechanism of Adam algorithm, and what is the difference between the Adagrad and Rmsprop methods?

3) How the ADAM algorithm should be tuned, and what configuration parameters it commonly uses

4) Adam's implementation of optimized process and weight update rules

5) Derivation of Adam's initial deviation correction

6) Adam's extended form: Adamax

1. What is Adam optimization algorithm?

Adam is a first-order optimization algorithm which can replace the traditional stochastic gradient descent process , which can be based on updating neural network weights of training data iteration.

First the algorithm is named "Adam", which is not an acronym or a person's name. His name is derived from adaptive moment estimation (Adaptive moment estimation). In introducing this algorithm, the original paper enumerates the advantages of applying Adam optimization algorithm to non-convex optimization problems:

1) straightforward Implementation

2) high-efficiency computing

3) less memory Required

4) invariance of gradient diagonal scaling

5) Ideal for solving optimization problems with large-scale data and parameters

6) used in non-stationary targets

7) suitable for solving problems that contain very high noise or sparse gradients

8) The hyper-parameters can be interpreted intuitively, and basically require only a very small amount of assistant

2. Basic mechanism of ADAM optimization algorithm

The ADAM algorithm differs from the traditional random gradient descent. the random gradient descent maintains a single learning rate (i.e. Alpha to update all weights, the learning rate does not change during the training process . and Adam The independent adaptive learning rate is designed for different parameters by the first order moment estimation and second moment estimation of the stochastic gradient.

The author of Adam's algorithm describes the set of advantages of two random gradient descent extensions, namely:

1) Adaptive gradient Algorithm (Adagrad) retains a learning rate for each parameter to improve performance on sparse gradients (i.e. natural language and computer vision problems)

2) RMS propagation is based on the mean value of the weighted gradient of the nearest magnitude for each parameter adaptive retention learning rate. This means that the algorithm has excellent performance on non-stationary and on-line issues.

3) Adam algorithm also obtains the advantages of Adagrad and Rmsprop algorithms. Adam not only calculates the adaptive parameter learning rate based on the first-order matrix as the Rmsprop algorithm, but also takes full advantage of the second moment mean of the gradient (i.e., the difference of folk prescription). Specifically, the algorithm calculates the exponential moving average of the gradient, and the beta1 and Beta2 control the attenuation rate of these moving averages.

4) The initial value of the moving mean and the beta1 and Beta2 values are close to 1 (recommended values), so the deviation of the moment estimation is close to 0, which is improved by first calculating the estimation of the band deviation and then calculating the corrected estimate of the deviation. If you are interested in the specific implementation details and derivation process, you can continue reading the second part and the original paper

3. Efficiency of ADAM algorithm

Adam is very popular in the field of deep learning, because he can quickly achieve good results, empirical results show that Adam algorithm has excellent performance in practice, compared with other kinds of stochastic optimization algorithm has a great advantage.

In the original paper, the author empirically proves that the convergence of Adam algorithm is consistent with the theoretical analysis. The ADAM algorithm can be applied to the convolutional neural network in the minist handwritten character recognition and the IMDB sentiment analysis data set due to the logistic regression algorithm, which can also be applied to the multilayer perceptron algorithm and the CIFAR-10 image recognition data set on the Minist data set. They concluded that, with large models and datasets, we proved the efficiency of Adam's optimization algorithm in solving local deep learning problems.

Comparison of Adam optimization algorithm and other optimization algorithms in multilayer Perceptron model

In fact, insofar, Rmsprop, Adadelata, and Adam algorithms are similar optimization algorithms that can perform very well in similar situations. However , the error correction of Adam's algorithm makes it more excellent and faster than the Rmsprop algorithm when the gradient becomes sparse. Insofar and Adam the optimization algorithm is basically the best global choice . Also in the cs21n course, the ADAM algorithm is recommended as the default optimization Suuna method

While Adam algorithms are better than rmsprop in practice, we can also try Sgd+nesterov momentum as an alternative to Adam. That is, we usually recommend the use of the ADAM algorithm or the Sgd+nesterov momentum method in a deep learning model.

4. Parameter configuration for Adam

Alpha : also known as the learning rate or step factor, it controls the update ratio of weights (such as 0.001). A larger value (such as 0.3) will start learning faster before the learning rate is updated, while smaller values, such as 1E-5, can converge to better performance

Beta1 : exponential decay rate for first-order moment estimation (e.g. 0.9)

Beta2 : The exponential attenuation rate of the second-order moment estimation (e.g. 0.99). The hyper-parameter should be set near 1 in the coefficient gradient (e.g. in NLP or computer vision tasks)

Epsilon : This parameter is a very small number in order to prevent the implementation from dividing by 0 (for example, 1E-8)

In addition, the learning rate attenuation can also be applied to Adam, and the original paper uses the decay rate alpha=alpha/sqrt (t) to be updated in each epoch (t) in logistic regression

5. Proposed parameter setting for Adam thesis

Test machine learning Problems The default parameters are set to: alpha=0.001,beta1=0.9,beta2=0.999 and epsilon=10e-8.

We can also see the popular deep Learning library using the recommended parameter settings for the paper:
tensorflow:learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08.

keras:lr=0.001,beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0.

blocks:learning_rate=0.002,beta1=0.9, beta2=0.999, epsilon=1e-08, Decay_factor=1.

lasagne:learning_rate=0.001,beta1=0.9, beta2=0.999, epsilon=1e-08

caffe:learning_rate=0.001,beta1=0.9, beta2=0.999, epsilon=1e-08

mxnet:learning_rate=0.001,beta1=0.9, beta2=0.999, epsilon=1e-8

torch:learning_rate=0.001,beta1=0.9, beta2=0.999, epsilon=1e-8

Summary: In the first part, we discuss the basic characteristics and principles of Adam's optimization algorithm in deep learning:

Adam is an optimization algorithm used in the deep learning model to replace the random gradient descent.

Adam combines the best performance of the Adagrad and Rmsprop algorithms, and it can also provide an optimization method for solving sparse gradients and noise problems.

Adam's argument is relatively simple, and the default parameter can handle most of the problems

We propose the ADAM algorithm, which is an algorithm that performs the Yi-gradient optimization on the random objective function, which is easy to implement based on adaptive low-order moment estimation, and has high computational efficiency and low memory requirements.

The diagonal scaling of the ADAM algorithm gradient is invariant, making it ideal for solving problems with large data or parameters. The algorithm is also suitable for solving the unsteady problem of large noise and sparse gradients. Hyper-parameters can be interpreted intuitively and require only a small amount of adjustment. In this paper, Adam algorithm and other similar algorithms are discussed. We analyze the theoretical convergence of Adam algorithm and provide the convergent interval, we prove the convergence speed on-line convex optimization framework to achieve the optimal. The empirical results also demonstrate that the ADAM algorithm is in practice comparable to other stochastic optimization methods. Finally, we discuss Adamax, a variant of Adam based on Infinity Norm.

Adam algorithm

Require: Step Alpha: (recommended by default: 0.001)

Require: the exponential decay rate of moment estimation, Beta1, Beta2 within the interval [0,1]. (The recommended default is: 0.9 and 0.999, respectively)

Require: small constant epsilon for numerical stability (recommended by default: 1E-8)

Require: The initial parameter is

Initializes first-and second-order variables s=0,r=0

Initialize Time step t=0

While not reaching stop standard do

A small batch containing M samples is collected from the training focus and the corresponding target is.

Calculate gradient:

The update has partial first-order moment estimation:

The update has partial second-order moment estimation:

Correcting the deviation of the first order moment:

Correcting second-order moment deviations:

Calculation Update:

App Update:

End While

Instance:

In the case of learning tensorflow examples, the optimization scheme given in the code is in many cases directly used by the Adamoptimizer optimization algorithm, as follows:

Adam optimization algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.