Deep Learning Note 9: Realization of weight update

Source: Internet
Author: User
Weight Update

In front of the reverse propagation we calculate the weight of each layer W and offset B of the partial derivative, the last step is to the weight and bias of the update.

In the introduction of the previous BP algorithm, we give the following formula:


The alpha is the learning rate, and the general learning rate is not a constant, but a monotonically decreasing function with the number of training times as the independent variable. There are several reasons to use changing learning rates:

1, the initial learning rate is large, you can quickly update the parameters in the network, is the parameters can be faster to achieve the target value. And because each update step size is large, you can "skip" the local minimum value point in the early stage of network training.

2, when the network training for a period of time, a larger learning rate may not increase the accuracy of the network, that is, "network training does not move", at this point we need to reduce the rate of learning to continue training network.

In our network, the layer containing the parameters has the convolution layer 1, the convolution Layer 2, the full connection layer 1 and the full connection Layer 2, a total of 4 layers have parameters need to be updated, each layer also has the right value W and offset B needs to be updated. In practice, regardless of weight or bias, and the gradients we've calculated earlier are all linearly stored, so we think of the entire update process as a one-dimensional array, without having to focus on whether the weight w is a 800*500 matrix, and so on, The implementation of weight update and bias Update can be used to share a code that operates on a one-dimensional array. Weight Update Strategy in Caffe, the method of random gradient descent (stochastic gradient descent) was used to update the weight value. The following formula (i.e. Momentum Update):.

of which the δhistory is cumulative for multiple gradients:.


The learning Rate update strategy in Caffe

In the \src\caffe\solvers\sgd_solver.cpp file annotation, Caffe gives the following learning rate update strategies:

Return to the current learning rate. The currently implemented learning rate
//policies are as follows:
//    -Fixed:always return BASE_LR.    -Step:return BASE_LR * Gamma ^ (floor (Iter/step))
//    -Exp:return BASE_LR * Gamma ^ iter
//    - Inv:return BASE_LR * (1 + gamma iter) ^ (-Power)
//    -multistep:similar to step but it allows non uniform St EPS defined
by//      Stepvalue
//    -poly:the Effective learning rate follows a polynomial decay 15/>//      Zero by the Max_iter return Base_lr (1-iter/max_iter) ^ (Power)
//    -sigmoid:the Effective-Learn ing rate follows a sigmod decay
//Return      BASE_LR (1/(1 + exp (-gamma * (iter-stepsize)))////
whe Re Base_lr, Max_iter, Gamma, step, stepvalue and power are defined
//in the Solver parameter protocol buffer, and it The ER is the current iteration.

It can be seen that the learning rate of the update has fixed, step, exp, inv, multistep, Poly and sigmoid several ways, see the formula above can clearly see its implementation process.

In practice our network uses the Inv Update method, namely LEARN_RATE=BASE_LR * (1 + gamma * iter) ^ (-power).


realization of weight update in Caffe

In the configuration file \examples\mnist\lenet_solver.prototxt, the parameters that are used when network initialization is saved, we look at the parameters related to the learning rate first.

# The base learning rate, momentum and the weight decay of the network.
base_lr:0.01
momentum:0.9
weight_decay:0.0005
# Learning rate policy Lr_policy
: "INV"
Gamma : 0.0001
power:0.75

Based on the above parameters, we can calculate the learning rate for each iteration learn_rate= BASE_LR * (1 + gamma iter) ^ (-power).


After acquiring the learning rate, we need to update the parameters in the network using the learning rate. The \src\caffe\solvers\sgd_solver.cpp contains the specific function applyupdate (), which is to update the weights, let's introduce the function.

Template <typename dtype>
void Sgdsolver<dtype>::applyupdate () {
  CHECK (Caffe::root_solver ());
  The Getlearningrate () function obtains the learning rate of this iteration
  dtype rate = getlearningrate ();
  if (This->param_.display () && this->iter_% this->param_.display () = = 0) {
    LOG (INFO) << Iteration "<< this->iter_ <<", lr = "<< rate;
  }
  Clipgradients ();
  To update the network, a total of 4 layers, each layer has W and B2 parameters need to be updated, so size=8 for
  (int param_id = 0; param_id < This->net_->learnable_params (). Size ();
       ++PARAM_ID) {
	//normalized, our network does not use this function
    normalize (param_id);
	Regularization of
    regularize (param_id);
	Compute the gradient
    computeupdatevalue (param_id, rate) used in the update
  ;
  Updating This->net_->update () with the gradient computed by Computeupdatevalue
  ;


The regularization function of regularize is mainly calculated.
The Computeupdatevalue function is divided into two steps, and the first step is to update the historical offset value.

The historical offset value is then assigned to the offset value.

The Lr_mult Learning rate factor parameter is used in Computeupdatevalue, which is also seen in previous configuration information, and weight and bias in the same layer may be updated with different learning rates, so there can be different lr_mult. Finally, the This->net_->update () function updates the parameters using the partial derivative computed from the front computeupdatevalue.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.