Caffe parameter configuration Solver.prototxt and optimization algorithm selection

Source: Internet
Author: User
Caffe parameter configuration Solver.prototxt and optimization algorithm selection


This article mainly includes the following content:



Caffe parameter configuration Solverprototxt and optimization algorithm selection solverprototxt introduction and Process solverprototxt optimization algorithm Select Batch gradient descent method bgd and stochastic gradient descent method of sgd sgd random gradient descent method Adagrad Adaptive Gradient Descent method Adadeltaadagrad Extended Rmaspropadadelta special case adamadaptive Moment estimation the best gradient descent algorithm for the current effect Rmaspropadadelta Special case adamadaptive Moment estimation gradient descent algorithm with the best effect at present Nagnesterov accelerated gradient accelerated gradient descent method



solver.prototxt Introduction and Process



Reference Web site
The. proto file is used to define the parameters of the structure body,. prototxt file to configure initialization net with the initialization data of the structure body
The parameter configuration file defines the parameters that need to be set during the training of network model, such as learning rate, weight attenuation coefficient, iteration times, using GPU or CPU.



In the forward process (forward), the loss is computed and the gradient gradient is computed in the reverse propagation. The parameter update quantity is calculated based on the error degree, the gradient of the regular item, and the specific items of other methods.



Solver Process: Design the objects that need to be optimized, as well as the training network for learning and the test network for evaluation. (by invoking another configuration file Prototxt) to follow the new parameters by optimizing the forward and backward iterations. Regular evaluation of the test network. (You can set the number of training, do one test) in the optimization process to show the model and Solver state

# Set the network model, the path of the file should start from the root directory of caffe
net: "examples / mnist / lenet_train_test.prototxt"

# Combined with the batch_size in the test layer, it is assumed that the total number of samples is 10,000, and it is inefficient to execute all the data at one time. Therefore, the test data is divided into several batches for execution, and the number of each batch is batch_size. Assuming batch_size is 100, 100 iterations are required to execute all 10,000 data, so test_iter is set to 100. (100 * 100 = 10000), a reasonable setting can make the test go through without testing the sample
test_iter: 100

# Do not test before training the network, and perform a test before training the network by default
test_initialization: false

# Test interval, that is, only one test is performed every test_interval training. Reasonable settings can make the training traverse all training samples
test_interval: 500

# The base learning rate, momentum and the weight decay of the network.
# base_lr is the basic learning rate
base_lr: 0.01
# Optimize algorithm selection, select random gradient descent here (SGD is the default, this line can be omitted)
type: SGD
# Weight of the last gradient update
momentum: 0.9
# Weight attenuation term, a parameter to prevent overfitting (regularization term)
weight_decay: 0.0005

# The learning rate policy
lr_policy: "inv"
# Learning rate change rate
gamma: 0.0001
power: 0.75

# Displayed on the screen every training display
display: 100

# Maximum number of iterations. Setting this number too small will result in no convergence and low accuracy. Setting too large will cause shock and waste time
max_iter: 10000

# State is saved. Snapshot is used to set how many times to save before training. snapshot_prefix setting save path
snapshot: 5000
snapshot_prefix: "examples / mnist / lenet"

# Set the operating mode: CPU or GPU (default is GPU)
solver_mode: CPU

Lr_policy: Learning Strategies



Lr_policy can be set to the following values, the corresponding learning rate is calculated as:



-Fixed:        keep base_lr unchanged.

   -Step:         If set to step, you also need to set a stepsize,  return BASE_LR * Gamma ^ (floor (iter/stepsize)), where ITER represents the current number of iterations

   -Exp:          return BASE_LR * Gamma ^ iter, iter for current iterations

   -INV:          If set to Inv, also need to set a power, return BASE_LR * (1 + gamma * iter) ^ (-power )

   -multistep:    If set to Multistep, you will also need to set a stepvalue. This parameter is similar to step, the step is uniform equal interval change, and Multistep is based on the Stepvalue value change

   -poly:         Learning rate of polynomial error, return BASE_LR (1-iter/max_iter) ^ ( Power)

   -sigmoid:      sigmod Attenuation of learning rate, return BASE_LR (1/(1 + exp (-gamma * (iter-stepsize)))




Related CPP Files
The solution model can select the batch gradient descent method (BGD) and the stochastic gradient descent method (SGD) in the digits visual solver.prototxt optimization algorithm .



The batch gradient descent method (batch gradient descent) is the most common form of gradient descent method, in which all samples are used to update the parameters. (All contents of the training set). (Easy memory out, falling into local minimum value)



Random Gradient descent method (stochastic gradient descent), where the random gradient descent method is in fact with MBGD (Minibatch gradient descent) is a meaning, that is, randomly sampling a group of samples, as a basis to update the parameters. (Increase the randomness, can jump out the local minimum value)



Their pros and cons are very prominent. For the training speed, the stochastic gradient descent method is very fast, while the batch gradient descent method can not satisfy the training speed when the sample size is very large.



For the accuracy, the stochastic gradient descent method is used to determine the gradient direction only with mini-batch samples, which leads to the possibility that the solution is not optimal, resulting in a great change in the iterative direction and not fast convergence to the local optimal solution.



Similarly, the batch size setting is relatively large, the effect will be better (cannot be set too small or too large). In the case of using a GPU, and using matrix multiplication (parallel operation), batch size equals 1 or 10 the time to compute a gradient is consistent. Therefore: When the batch size is set to 10, the running speed is almost 10 times times faster than the batch size setting of 1. And. The randomness is smaller, the oscillation is less obvious, and the speed of convergence is accelerated. SGD: Stochastic gradient descent method



SGD, the random gradient descent method is the gradient that computes the mini-batch for each iteration, and then updates the parameters. Using the linear combination of the negative gradient and the updated value of the last weight (using the traditional gradient descent method, it is easy to converge to the local optimum, and to choose the appropriate learing rate difficulty). The learning rate A is the weight of the negative gradient. Momentum U is the weight of the last updated value.






Learning parameters require a certain adjustment to achieve the best results. In general, the learning rate A is initialized to 0.01, and then in training when the loss is stable, divide a by a constant and repeat the process multiple times. For the momentum u set to 0.9. U can make the renewal of weight more gentle (anti jitter), make the learning process more stable and fast.
By cross-validation, momentum parameters are usually set to [0.5,0.9,0.95,0.99] . Sometimes it changes over time, from 0.5 to 0.99; Indicates the extent to which the original update direction is to be retained, between 0-1 and at the start of the training, the initial value is generally 0.5 when the gradient may be large, and when the gradient is not so large, it is changed to 0.9. The learning rate, i.e. the current batch gradient, affects the final update direction in the same way as the average SGD.
Feature: In the beginning of the descent, the last update parameter was used, and if the descending direction was consistent, the larger u could accelerate well. In the lower and middle later period, when the local minimum value swings back and forth, the gradient->0,u makes the update amplitude increase, thus jumping out of the trap. When the gradient changes direction, u can reduce the update. In a nutshell, momentum can accelerate SGD in the relevant direction, inhibit oscillation, and accelerate convergence. Adagrad: Self-adaptive gradient descent method



The adaptive gradient descent method is used to restrain the learning rate. That
  
  
  
Features: Early g_t smaller, regularizer larger, can enlarge gradient, late g_t larger time, Regularizer smaller, can constrain gradient, combined treatment sparse gradient.
The smooth EPS (typically set to 1e-4 to 1e-8) prevents the occurrence of dividing by 0.
Disadvantages: Still rely on manually set a global learning rate, the learning rate set too large, will make regularizer too sensitive to the adjustment of the gradient is too large, in the late, the denominator of the cumulative gradient squared will be more and more large, so that gradient->0, so that the training end early. extension of the Adadelta:adagrad



Adadelta is an extension of the Adagrad. The Adagrad will add up all the previous gradient squares, while the Adadelta only accumulate fixed-size items and do not store them directly, just approximate the corresponding average .
  
  
T%5E2)
Features: Training in junior high, accelerated effect is good, very soon; later in the training, repeatedly jitter around the local minimum value



Also: Using the Adadelta algorithm, we don't even need to set the default learning rate.



e[δθ2]t=γe[δθ2]t−1+ (1−γ) δθ2t e[\delta \theta^2]_t = \gamma E[\delta \theta^2]_{t-1} + (1-\gamma) \Delta \theta^2_t
Rms[δθ]t=e[δθ2]t+ϵ−−−−−−−−−√rms[\delta \theta]_{t} = \sqrt{e[\delta \theta^2]_t + \epsilon}
ΔΘT=−RMS[ΔΘ]T−1RMS[G]TGT \delta \theta_t =-\dfrac{rms[\delta \theta]_{t-1}}{rms[g]_{t}} G_{t}



A special case of Rmasprop:adadelta


    cache =  Decay_rate * cache + (1-decay_rate) * dx**2
    x + =-Learning_rate * DX/(NP.SQRT (cache) + EPS)


is a very efficient method of adaptive learning rate. This method modifies the Adagrad method in a very simple way, making it less radical and monotonically reducing the learning rate. Specifically, it uses a sliding average of a gradient squared.



In code, Decay_rate is a super parameter, and the commonly used value is [0.9,0.99,0.999]. Rmsprop is still based on the size of the gradient to change the learning rate of each weight, which is also a good effect. However, unlike Adagrad, its update does not make the learning rate monotonous and does not occur when the update stops.



Features: Rmsprop depends on the global learning rate, Rmsprop is a adagrad development, and Adadelta variant, the effect tends to between the two, suitable for dealing with non-stationary targets for the RNN effect is very good. adam:adaptive Moment Estimation (gradient descent algorithm with the best effect at present)



Rmsprop, which is essentially a momentum term, adjusts the learning rate of each parameter dynamically using the first-order moment estimation of the gradient and the second-order moment estimation.
  
  
   A special case of Rmasprop:adadelta


    cache =  Decay_rate * cache + (1-decay_rate) * dx**2
    x + =-Learning_rate * DX/(NP.SQRT (cache) + EPS)


is a very efficient method of adaptive learning rate. This method modifies the Adagrad in a very simple way.
method, so that it is not so radical, monotonous reduction of learning rate. Specifically, it uses a sliding average of a gradient squared

In code, Decay_rate is a super parameter, and the commonly used value is [0.9,0.99,0.999]. Rmsprop still
is based on the gradient of the size of each weight to modify the learning rate, which is also a good effect. But unlike Adagrad, it is more
The new will not make the learning rate monotonous and will not occur when the update stops.
Characteristics: Rmsprop depends on the global learning rate, rmsprop is a kind of adagrad development, and Adadelta change
Body, the effect tends to between the two, suitable for dealing with non-stationary targets for rnn effect is very good. adam:adaptive Moment Estimation (gradient descent algorithm with the best effect at present)


The

is essentially a rmsprop with a momentum term, which dynamically adjusts the learning rate of each
parameter using the gradient's first-order moment estimation and the second-order moment estimation.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.