Caffe in Base_lr, Weight_decay, Lr_mult, Decay_mult mean?

Source: Internet
Author: User

In machine learning or pattern recognition, there will be overfitting, and when the network gradually overfitting, the network weights gradually become larger, therefore, in order to avoid the occurrence of overfitting, the error function will be added a penalty, The common penalty is the sum of the squares of the weight of ownership multiplied by an attenuation constant. It is used to punish large weights.

The learning rate is a parameter so determines how much an updating step influences the current value of the weights. While weight decay is a additional term in the weight update rule this causes the weights to exponentially decay to zero, If no other update is scheduled.

So let's say that we had a cost or error functionE(w) That's we want to minimize. Gradient descent tells us to modify the weights W in the direction of steepest descent in E:

wi←w i− η ∂e ∂wi

Where η  is The learning rate, and if it's large you'll have a correspondingly large modification of T He weights < Span id= "mathjax-span-42" class= "Mrow" > wi (In general it shouldn ' t is too large, otherwise you'll overshoot the local minimum in your cost function).

In order to effectively limit the number of free parameters in your model so as to avoid over-fitting, it's possible to R Egularize the cost function. An easy-to-do-is-introducing a zero mean Gaussian prior over the weights, which-equivalent to changing the Cost function toE˜ (w) =e ( Span id= "mathjax-span-61" class= "Texatom" > w) +λ 2w2 . In practice this penalizes large weights and effectively limits the freedom in your model. The regularization parameter λ determines how to trade off the original cost E with the large weights penalization.

Applying gradient descent to the new cost function we obtain:

WI←w I−η ∂ e ∂w i− ηλwi

< Span id= "Mathjax-element-6-frame" class= "Mathjax" >the new Term −ηλ wi  coming from the regularization causes the weight to decay in proportion to its size.

In your solver you likely has a learning rate set as well as weight decay.  Lr_mult indicates what to multiply the learning rate by a particular layer. This is useful if you want to update some layers with a smaller learning rate (e.g. when finetuning some layers while trai Ning others from scratch) or if you don't want to update the weights for one layer (perhaps you keep all the conv layers  The same and just retrain fully connected layers). Decay_mult is the same, just for weight decay.

Reference: Http://stats.stackexchange.com/questions/29130/difference-between-neural-net-weight-decay-and-learning-rate

http://blog.csdn.net/u010025211/article/details/50055815

Https://groups.google.com/forum/#!topic/caffe-users/8J_J8tc1ZHc

Caffe in Base_lr, Weight_decay, Lr_mult, Decay_mult mean?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.