The optimization method in Caffe

Source: Internet
Author: User

In deep learning, it is often loss function is non-convex, there is no analytic solution, we need to solve by the optimization method. The Caffe attempts to reduce the loss by coordinating the forward propagation of the entire network and the back gradient to update the parameters.

Caffe has encapsulated three optimization methods, namely stochastic Gradient descent (SGD), Adaptivegradient (Adagrad), and Nesterov ' s accelerated Gradient (NAG).


The Solver process:

1. Design the objects that need to be optimized, as well as the training network for learning and the test network for evaluation.

2. Optimize with forward and backward iterations to follow the new parameters

3. Regular evaluation of the test network

4. Show the status of the model and solver during the optimization process

The process of each step iteration

1. Compute the output and loss of the network via forward

2. Compute the gradient of the network by backward

3. Use gradients to update parameters according to the Solver method

4. Update the status of Solver according to learning Rate,history and method

Like the Caffe model, Caffe Solvers can also be run CPU/GPU.

1. The Methodssolver method is generally used to solve the problem of minimizing the loss function. For a DataSet D, the objective function that needs to be optimized is the average of all data loss in the entire data set.


Wherein, R (W) is a regular term, in order to weaken the overfitting phenomenon.

If this loss function is used, the iteration needs to calculate the entire data set at a time, which is very inefficient in the case of a very large dataset, and this is also the method we know of gradient descent.


In practice, a mini-batch of the entire data set is used, and the number is n<<| d|, the loss function at this time is:



With the loss function, you can iteratively solve loss and gradients to optimize the problem. In the neural network, the forward pass is used to solve the loss, and the backward pass is used to solve the gradient.

1.1 SGD Type: SGD
The random gradient descent (Stochastic gradient descent) updates W with the linear combination of the negative gradient and the last weight update value v_t, and the iteration formula is as follows:



Among them, learning rate is the weight of the negative gradient, momentum is the last more line weight. These two parameters need to be tuning to get the best results, usually based on experience. If you do not know how to set these parameters, you can refer to the following rules of thumb, if you need to know more parameter setting techniques can refer to the paper stochastic Gradient descent Tricks [1].


Set the rules of thumb for learningrate and momentum

Example

base_lr:0.01     # Begin training at a learning rate of0.01 = 1e-2 lr_policy: "Step" # Learning Ratepolicy:drop The Lear Ning rate in ' Steps '                  # by a factor of gamma everystepsize iterations gamma:0.1        # Drop The learning rate by a fact or of10                  # (i.e., multiply it by afactor of gamma = 0.1) stepsize:100000  # Drop the learning rate every 100K Itera tions max_iter:350000  # Train for 350K iterations Total momentum:0.9

To use SGD in deep learning, a good strategy for initializing parameters is to set learning rate to about 0.01, and in the course of training, if loss starts to appear stable, multiply the learning by a constant factor (for example, 10), which is repeated several times. In addition, for momentum, generally set to 0.9,momentum can make the deep learning method using SGD more stable and fast, this time the initial parameter of the thesis Imagenet classification with deeper convolutional Neural Networks [2].

In the above example, the value of initialize learning rate is 0.01, and after the first 100K iteration, the value of update learning rate (multiplied by gamma) is 0.01*0.1=0.001, used for 100k-200k iterations, one analogy, Until the maximum number of iterations is reached 350K.

Note that the momentum settingμeffectively multiplies the size of your updates by a factor of 11?μafter many iterations of training, so if you increaseμ, it is a good idea to decreaseαaccordingly (and vice versa).

For example, withμ=0.9, we had an effective update size multiplier of 11?0.9=10. If we increased the momentum toμ=0.99, we ' ve increased we update size multiplier to +, so we should dropα (BASE_LR) b Y a factor of 10.

The above settings can only be used as a guideline, and they do not guarantee the best results in any situation, and sometimes this method is not even work. If you have diverge when you are learning (for example, when you first find a very large or Nan or INF loss value or output), you need to lower the value of BASE_LR (for example, 0.001), then retrain, This process repeats several times until you find a BASE_LR that can work.

1.2 Adagrad Type: Adagrad

The adaptive gradient (adaptive gradient) [3] is a gradient-based optimization method (like SGD), as the author would say, "Find needles in haystacks in the form of very predictive but RAR Ely seen features ". Given the update information for all previous iterations, the update for the I component of each w is as follows:



In practice, it should be noted that the implementation of weights, Adagrad (including in Caffe) only requires the use of additional storage to preserve historical gradient information, rather than the storage (which requires independent preservation of each historical gradient information). (I don't understand the meaning here)

1.3 Nag Type: nag
Nesterov's accelerated Gradient method (Nesterov's accelerated gradient) is the ideal method for convex optimization, and its convergence speed can be achieved rather than. However, because the optimization problem in deep learning is often non-smooth and non-convex (non-smoothness and non-convexity), in practice, nag can be a very effective optimization method for some kind of deep learning structure, such as deeper MNIST AUTOENCODERS[5].


The update for weights is very similar to SGD:



The difference is that when calculating gradients, the weights plus the momentum gradient are solved in the nag, whereas in SGD it is simple to calculate the gradient of the current weight.

2. Reference: [1] L. Bottou. Stochastic Gradient descent Tricks. Neural networks:tricks of the Trade:springer, 2012.
[2] A. Krizhevsky, I. Sutskever, and G. Hinton. ImageNet classification with deep convolutional neural Networks. Advances in neural information processing Systems, 2012.
[3] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient Methods for Online learning and Stochastic optimization. The Journal of machine learning, 2011.
[4] Y. Nesterov. A Method of solving a convex programming problem with Convergence rate O (1/k√). Soviet Mathematics Doklady, 1983.
[5] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and Momentum in deep learning. Proceedings of the 30th International Conference on Machine learning, 2013.
[6] Http://caffe.berkeleyvision.org/tutorial/solver.html

The optimization method in Caffe

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.