Caffe Learning Series (8): Solver optimization method

Source: Internet
Author: User

As mentioned above, so far, Caffe has provided six optimization methods in total:

    • Stochastic Gradient descent ( type: "SGD" ),
    • Adadelta ( type: "AdaDelta" ),
    • Adaptive Gradient ( type: "AdaGrad" ),
    • Adam ( type: "Adam" ),
    • Nesterov ' s accelerated Gradient ( type: "Nesterov" ) and
    • Rmsprop ( type: "RMSProp" )

Solver is the optimization method used to minimize loss. For a DataSet D, the objective function that needs to be optimized is the average of all data loss in the entire data set.

where FW (x (i)) calculates the loss on data x (i), the loss of each individual sample x is first calculated, then summed, and finally the mean is obtained. R (W) is a regular term (Weight_decay), in order to weaken the overfitting phenomenon.

If this loss function is used, the iteration needs to calculate the entire data set at a time, which is very inefficient in the case of a very large dataset, and this is also the method we know of gradient descent.


In practice, by dividing the entire data set into batches (batches), each batch is a mini-batch, and the number (batch_size) is n<<| d|, the loss function at this time is:



With the loss function, you can iteratively solve loss and gradients to optimize the problem. In the neural network, the forward pass is used to solve the loss, and the backward pass is used to solve the gradient.

In Caffe, the stochastic Gradient descent (SGD) is used by default to optimize the solution. The following methods are also gradient-based optimization methods (like SGD), so this article only describes SGD. Other methods, interested students, can go to read the original document.

1. Stochastic gradient descent (SGD)

The random gradient descent (Stochastic gradient descent) is developed on the basis of the gradient descent method (gradient descent), and the gradient descent method is also called the steepest descent method, the specific principle in the NetEase public course "machine learning", Professor Wunda has explained it in great detail. SGD updates W with a linear combination of the negative gradient and the last weight update value VT, and the iteration formula is as follows:



Among them, is the negative gradient of the learning rate (BASE_LR), is the weight of the previous gradient value (momentum), used to weigh the gradient direction before the current gradient direction of the impact. These two parameters need to be tuning to get the best results, usually based on experience. If you do not know how to set these parameters, you can refer to the relevant papers.

In deep learning using SGD, the strategy for better initialization parameters is to set the learning rate to about 0.01 (base_lr:0.01), in the course of training, if the loss start to appear stable level, the learning rate multiplied by a constant factor (gamma), the process is repeated several times.

For momentum, the general value is between 0.5--0.99. Often set to 0.9,momentum, you can make deep learning methods that use SGD more stable and fast.

For more momentum, please refer to Hinton's "A Practical Guide to Training Restricted Boltzmann machines".

Instance:

base_lr:0.01"step"0.1   3500    0.9

Lr_policy is set to step, the change rule for the learning rate is BASE_LR * Gamma ^ (floor (iter/stepsize))

That is, the first 1000 iterations, the learning rate is 0.01; 第1001-2000次 iteration, the learning rate was 0.001; 第2001-3000次 iteration, learning rate is 0.00001, 第3001-3500次 iteration, learning rate is 10-5

The above settings can only be used as a guideline, and they do not guarantee the best results in any situation, and sometimes this method is not even work. If you have diverge when you are learning (for example, when you first find a very large or Nan or INF loss value or output), you need to lower the value of BASE_LR (for example, 0.001), then retrain, This process repeats several times until you find a BASE_LR that can work.

2, Adadelta

Adadelta is a "robust learning rate method" and is a gradient-based optimization method (like SGD).

Specific introduction to the literature:

M. Zeiler adadelta:an ADAPTIVE learning rate METHOD. arXiv preprint, 2012.

3, Adagrad

Adaptive gradient (adaptive gradient) is a gradient-based optimization method (like SGD)

Specific introduction to the literature:

Duchi, E. Hazan, and Y. Singer. Adaptive subgradient Methods for Online learning and Stochastic optimization. The Journal of machine learning, 2011.

4. Adam

is a gradient-based optimization method (like SGD).

Specific introduction to the literature:

D. Kingma, J. Ba. Adam:a Method for Stochastic optimization. International Conference for Learning representations, 2015.

5. NAG

Nesterov's accelerated Gradient method (Nesterov's accelerated gradient) is the most ideal method in convex optimization, and its convergence speed is very fast.

Specific introduction to the literature:

I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and Momentum in deep learning. Proceedings of the 30th International Conferenceon machine learning, 2013.

6, Rmsprop

Rmsprop was proposed by Tieleman in a Coursera lecture, and is also a gradient-based optimization method (like SGD)

Specific introduction to the literature:

T. Tieleman, and G. Hinton. Rmsprop:divide the gradient by a running average of its recent magnitude. coursera:neural Networksfor Machine learning.technical, 2012.

Caffe Learning Series (8): Solver optimization method

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.