Torch.optim Optimization Algorithm Understanding Optim.adam ()

Last Update:2018-07-26 Source: Internet

Author: User

Tags closure

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Torch.optim is a package that implements a variety of optimization algorithms, and most of the common methods are supported, providing rich interface calls that will be integrated in more refined optimization algorithms in the future.
In order to use Torch.optim, it is necessary to construct an optimizer object optimizerto hold the current state and to update the parameters based on the computed gradient.
To build an optimizer optimizer, you have to give it a list of all the parameters (all parameters must be variables s) that can be iteratively optimized. You can then specify the program to optimize specific options, such as learning rate, weight decay, and so on.

Optimizer = Optim. SGD (Model.parameters (), lr = 0.01, momentum=0.9)
optimizer = Optim. Adam ([Var1, var2], lr = 0.0001)
self.optimizer_d_b = Torch.optim.Adam (Self.netD_B.parameters (), LR=OPT.LR, betas= ( Opt.beta1, 0.999))

Optimizer also supports specifying each parameter option. You simply pass an iterative dict to replace the previously iterated variable. Each item in the DICT can be defined as a separate parameter group with a params key to contain the list of arguments that belong to it. Other keys should match the keyword parameters accepted by the optimizer in order to be used as an optimization option for this group.

Optim. SGD ([
                {' params ': Model.base.parameters ()},
                {' params ': model.classifier.parameters (), ' LR ': 1e-3}
            ], lr= 1e-2, momentum=0.9)

As above, model.base.parameters () will use 1e-2 of the learning rate, model.classifier.parameters () will use 1e-3 of the learning rate. The momentum of 0.9 acts on all parameters.
Optimization steps:
All optimizer optimizer implements the step () method to update all parameters, which have two methods of invocation:

Optimizer.step ()

This is a simplified version that is supported by most optimizations, and is called when the gradient is computed using the backward () method shown below.

For input, Target in dataset:
    Optimizer.zero_grad ()
    output = model (input)
    loss = LOSS_FN (output, target)
    Loss.backward ()
    optimizer.step ()

Optimizer.step (Closure)

Some optimization algorithms, such as conjugate gradients and LBFGS need to reevaluate the target function multiple times, so you must pass a closure to recalculate the model. Closure must clear the gradient, calculate and return the loss.

For input, Target in dataset:
    def closure ():
        optimizer.zero_grad ()
        output = model (input)
        loss = Loss_fn (output, target)
        Loss.backward ()
        return loss
    optimizer.step (closure)

Adam algorithm:

Adam algorithm Source: Adam:a Method for Stochastic optimization

Adam (Adaptive moment estimation) is essentially a rmsprop with a momentum term, which dynamically adjusts the learning rate of each parameter using the first-order moment estimation of the gradient and the second moment estimation. The main advantage of it is that after biased correction, each iteration learning rate has a definite range, which makes the parameters more stable. The formula is as follows:

Among them, the first two formulas are the first order moment estimation and second moment estimation of the gradient, which can be regarded as the estimation of the expected e|gt|,e|gt^2|;
The formula 3,4 is the correction of the first order second moment estimation, which can be approximated to the unbiased estimation of the expectation. It can be seen that the moment estimation of the gradient directly has no additional requirements for memory, and can be dynamically adjusted according to the gradient. The last part is a dynamic constraint on the learning rate N, and has a definite scope .

Class Torch.optim.Adam (params, lr=0.001, betas= (0.9, 0.999), eps=1e-08, weight_decay=0)

Parameters:

params (iterable): A parameter that can be used for iterative optimization or a dicts that defines a parameter group.
LR (float, optional): Learning rate (default: 1e-3)
betas (tuple[float, float], optional): Coefficient of average peace for calculating gradients (default: (0.9, 0.999))
EPS (float, optional): An item added to the denominator for increased numerical stability (default: 1e-8)
Weight_decay (float, optional): weight decay (e.g. L2 penalty) (default: 0)

Step (closure=none) function: Perform a single optimization step

Torch.optim.adam Source:

import math
from .optimizer import Optimizer

class Adam(Optimizer):
    def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8，weight_decay=0):
        defaults = dict(lr=lr, betas=betas, eps=eps,weight_decay=weight_decay)
        super(Adam, self).__init__(params, defaults)

    def step(self, closure=None):
        loss = None
        if closure is not None:
            loss = closure()

        for group in self.param_groups:
            for p in group['params']:
                if p.grad is None:
                    continue
                grad = p.grad.data
                state = self.state[p]

                # State initialization
                if len(state) == 0:
                    state['step'] = 0
                    # Exponential moving average of gradient values
                    state['exp_avg'] = grad.new().resize_as_(grad).zero_()
                    # Exponential moving average of squared gradient values
                    state['exp_avg_sq'] = grad.new().resize_as_(grad).zero_()

                exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
                beta1, beta2 = group['betas']

                state['step'] += 1

                if group['weight_decay'] != 0:
                    grad = grad.add(group['weight_decay'], p.data)

                # Decay the first and second moment running average coefficient
                exp_avg.mul_(beta1).add_(1 - beta1, grad)
                exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)

                denom = exp_avg_sq.sqrt().add_(group['eps'])

                bias_correction1 = 1 - beta1 ** state['step']
                bias_correction2 = 1 - beta2 ** state['step']
                step_size = group['lr'] * math.sqrt(bias_correction2) / bias_correction1

                p.data.addcdiv_(-step_size, exp_avg, denom)

        return loss

Adam Features:
1, combined with Adagrad good at dealing with sparse gradients and rmsprop good at dealing with the advantages of non-stationary targets;
2, the memory requirements are small;
3, for different parameters to calculate the different adaptive learning rate;
4, also for most non-convex optimizations-for large datasets and high-dimensional spaces.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More