Weight Update
In front of the reverse propagation we calculate the weight of each layer W and offset B of the partial derivative, the last step is to the weight and bias of the update.
In the introduction of the previous BP algorithm, we give the following formula:
The alpha is the learning rate, and the general learning rate is not a constant, but a monotonically decreasing function with the number of training times as the independent variable. There are several reasons to use changing learning rates:
1, the initial learning rate is large, you can quickly update the parameters in the network, is the parameters can be faster to achieve the target value. And because each update step size is large, you can "skip" the local minimum value point in the early stage of network training.
2, when the network training for a period of time, a larger learning rate may not increase the accuracy of the network, that is, "network training does not move", at this point we need to reduce the rate of learning to continue training network.
In our network, the layer containing the parameters has the convolution layer 1, the convolution Layer 2, the full connection layer 1 and the full connection Layer 2, a total of 4 layers have parameters need to be updated, each layer also has the right value W and offset B needs to be updated. In practice, regardless of weight or bias, and the gradients we've calculated earlier are all linearly stored, so we think of the entire update process as a one-dimensional array, without having to focus on whether the weight w is a 800*500 matrix, and so on, The implementation of weight update and bias Update can be used to share a code that operates on a one-dimensional array. Weight Update Strategy in Caffe, the method of random gradient descent (stochastic gradient descent) was used to update the weight value. The following formula (i.e. Momentum Update):.
of which the δhistory is cumulative for multiple gradients:.
The learning Rate update strategy in Caffe
In the \src\caffe\solvers\sgd_solver.cpp file annotation, Caffe gives the following learning rate update strategies:
Return to the current learning rate. The currently implemented learning rate
//policies are as follows:
// -Fixed:always return BASE_LR. -Step:return BASE_LR * Gamma ^ (floor (Iter/step))
// -Exp:return BASE_LR * Gamma ^ iter
// - Inv:return BASE_LR * (1 + gamma iter) ^ (-Power)
// -multistep:similar to step but it allows non uniform St EPS defined
by// Stepvalue
// -poly:the Effective learning rate follows a polynomial decay 15/>// Zero by the Max_iter return Base_lr (1-iter/max_iter) ^ (Power)
// -sigmoid:the Effective-Learn ing rate follows a sigmod decay
//Return BASE_LR (1/(1 + exp (-gamma * (iter-stepsize)))////
whe Re Base_lr, Max_iter, Gamma, step, stepvalue and power are defined
//in the Solver parameter protocol buffer, and it The ER is the current iteration.
It can be seen that the learning rate of the update has fixed, step, exp, inv, multistep, Poly and sigmoid several ways, see the formula above can clearly see its implementation process.
In practice our network uses the Inv Update method, namely LEARN_RATE=BASE_LR * (1 + gamma * iter) ^ (-power).
realization of weight update in Caffe
In the configuration file \examples\mnist\lenet_solver.prototxt, the parameters that are used when network initialization is saved, we look at the parameters related to the learning rate first.
# The base learning rate, momentum and the weight decay of the network.
base_lr:0.01
momentum:0.9
weight_decay:0.0005
# Learning rate policy Lr_policy
: "INV"
Gamma : 0.0001
power:0.75
Based on the above parameters, we can calculate the learning rate for each iteration learn_rate= BASE_LR * (1 + gamma iter) ^ (-power).
After acquiring the learning rate, we need to update the parameters in the network using the learning rate. The \src\caffe\solvers\sgd_solver.cpp contains the specific function applyupdate (), which is to update the weights, let's introduce the function.
Template <typename dtype>
void Sgdsolver<dtype>::applyupdate () {
CHECK (Caffe::root_solver ());
The Getlearningrate () function obtains the learning rate of this iteration
dtype rate = getlearningrate ();
if (This->param_.display () && this->iter_% this->param_.display () = = 0) {
LOG (INFO) << Iteration "<< this->iter_ <<", lr = "<< rate;
}
Clipgradients ();
To update the network, a total of 4 layers, each layer has W and B2 parameters need to be updated, so size=8 for
(int param_id = 0; param_id < This->net_->learnable_params (). Size ();
++PARAM_ID) {
//normalized, our network does not use this function
normalize (param_id);
Regularization of
regularize (param_id);
Compute the gradient
computeupdatevalue (param_id, rate) used in the update
;
Updating This->net_->update () with the gradient computed by Computeupdatevalue
;
The regularization function of regularize is mainly calculated.
The Computeupdatevalue function is divided into two steps, and the first step is to update the historical offset value.
The historical offset value is then assigned to the offset value.
The Lr_mult Learning rate factor parameter is used in Computeupdatevalue, which is also seen in previous configuration information, and weight and bias in the same layer may be updated with different learning rates, so there can be different lr_mult. Finally, the This->net_->update () function updates the parameters using the partial derivative computed from the front computeupdatevalue.