Torch.optim is a package that implements a variety of optimization algorithms, and most of the common methods are supported, providing rich interface calls that will be integrated in more refined optimization algorithms in the future.
In order to use Torch.optim, it is necessary to construct an optimizer object optimizerto hold the current state and to update the parameters based on the computed gradient.
To build an optimizer optimizer, you have to give it a list of all the parameters (all parameters must be variables s) that can be iteratively optimized. You can then specify the program to optimize specific options, such as learning rate, weight decay, and so on.
Optimizer = Optim. SGD (Model.parameters (), lr = 0.01, momentum=0.9)
optimizer = Optim. Adam ([Var1, var2], lr = 0.0001)
self.optimizer_d_b = Torch.optim.Adam (Self.netD_B.parameters (), LR=OPT.LR, betas= ( Opt.beta1, 0.999))
Optimizer also supports specifying each parameter option. You simply pass an iterative dict to replace the previously iterated variable. Each item in the DICT can be defined as a separate parameter group with a params key to contain the list of arguments that belong to it. Other keys should match the keyword parameters accepted by the optimizer in order to be used as an optimization option for this group.
Optim. SGD ([
{' params ': Model.base.parameters ()},
{' params ': model.classifier.parameters (), ' LR ': 1e-3}
], lr= 1e-2, momentum=0.9)
As above, model.base.parameters () will use 1e-2 of the learning rate, model.classifier.parameters () will use 1e-3 of the learning rate. The momentum of 0.9 acts on all parameters.
Optimization steps:
All optimizer optimizer implements the step () method to update all parameters, which have two methods of invocation:
Optimizer.step ()
This is a simplified version that is supported by most optimizations, and is called when the gradient is computed using the backward () method shown below.
For input, Target in dataset:
Optimizer.zero_grad ()
output = model (input)
loss = LOSS_FN (output, target)
Loss.backward ()
optimizer.step ()
Optimizer.step (Closure)
Some optimization algorithms, such as conjugate gradients and LBFGS need to reevaluate the target function multiple times, so you must pass a closure to recalculate the model. Closure must clear the gradient, calculate and return the loss.
For input, Target in dataset:
def closure ():
optimizer.zero_grad ()
output = model (input)
loss = Loss_fn (output, target)
Loss.backward ()
return loss
optimizer.step (closure)
Adam algorithm:
Adam algorithm Source: Adam:a Method for Stochastic optimization
Adam (Adaptive moment estimation) is essentially a rmsprop with a momentum term, which dynamically adjusts the learning rate of each parameter using the first-order moment estimation of the gradient and the second moment estimation. The main advantage of it is that after biased correction, each iteration learning rate has a definite range, which makes the parameters more stable. The formula is as follows:
Among them, the first two formulas are the first order moment estimation and second moment estimation of the gradient, which can be regarded as the estimation of the expected e|gt|,e|gt^2|;
The formula 3,4 is the correction of the first order second moment estimation, which can be approximated to the unbiased estimation of the expectation. It can be seen that the moment estimation of the gradient directly has no additional requirements for memory, and can be dynamically adjusted according to the gradient. The last part is a dynamic constraint on the learning rate N, and has a definite scope .
Class Torch.optim.Adam (params, lr=0.001, betas= (0.9, 0.999), eps=1e-08, weight_decay=0)
Parameters:
params (iterable): A parameter that can be used for iterative optimization or a dicts that defines a parameter group.
LR (float, optional): Learning rate (default: 1e-3)
betas (tuple[float, float], optional): Coefficient of average peace for calculating gradients (default: (0.9, 0.999))
EPS (float, optional): An item added to the denominator for increased numerical stability (default: 1e-8)
Weight_decay (float, optional): weight decay (e.g. L2 penalty) (default: 0)
Step (closure=none) function: Perform a single optimization step
Torch.optim.adam Source:
import math
from .optimizer import Optimizer
class Adam(Optimizer):
def __init__(self, params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8,weight_decay=0):
defaults = dict(lr=lr, betas=betas, eps=eps,weight_decay=weight_decay)
super(Adam, self).__init__(params, defaults)
def step(self, closure=None):
loss = None
if closure is not None:
loss = closure()
for group in self.param_groups:
for p in group['params']:
if p.grad is None:
continue
grad = p.grad.data
state = self.state[p]
# State initialization
if len(state) == 0:
state['step'] = 0
# Exponential moving average of gradient values
state['exp_avg'] = grad.new().resize_as_(grad).zero_()
# Exponential moving average of squared gradient values
state['exp_avg_sq'] = grad.new().resize_as_(grad).zero_()
exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
beta1, beta2 = group['betas']
state['step'] += 1
if group['weight_decay'] != 0:
grad = grad.add(group['weight_decay'], p.data)
# Decay the first and second moment running average coefficient
exp_avg.mul_(beta1).add_(1 - beta1, grad)
exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
denom = exp_avg_sq.sqrt().add_(group['eps'])
bias_correction1 = 1 - beta1 ** state['step']
bias_correction2 = 1 - beta2 ** state['step']
step_size = group['lr'] * math.sqrt(bias_correction2) / bias_correction1
p.data.addcdiv_(-step_size, exp_avg, denom)
return loss
Adam Features:
1, combined with Adagrad good at dealing with sparse gradients and rmsprop good at dealing with the advantages of non-stationary targets;
2, the memory requirements are small;
3, for different parameters to calculate the different adaptive learning rate;
4, also for most non-convex optimizations-for large datasets and high-dimensional spaces.