In machine learning or pattern recognition, there will be overfitting, and when the network gradually overfitting, the network weights gradually become larger, therefore, in order to avoid the occurrence of overfitting, the error function will be added a penalty, The common penalty is the sum of the squares of the weight of ownership multiplied by an attenuation constant. It is used to punish large weights.
The learning rate is a parameter so determines how much an updating step influences the current value of the weights. While weight decay is a additional term in the weight update rule this causes the weights to exponentially decay to zero, If no other update is scheduled.
So let's say that we had a cost or error functionE(w) That's we want to minimize. Gradient descent tells us to modify the weights W in the direction of steepest descent in E:
wi←w i− η ∂e ∂wi
Where η is The learning rate, and if it's large you'll have a correspondingly large modification of T He weights < Span id= "mathjax-span-42" class= "Mrow" > wi (In general it shouldn ' t is too large, otherwise you'll overshoot the local minimum in your cost function).
In order to effectively limit the number of free parameters in your model so as to avoid over-fitting, it's possible to R Egularize the cost function. An easy-to-do-is-introducing a zero mean Gaussian prior over the weights, which-equivalent to changing the Cost function toE˜ (w) =e ( Span id= "mathjax-span-61" class= "Texatom" > w) +λ 2w2 . In practice this penalizes large weights and effectively limits the freedom in your model. The regularization parameter λ determines how to trade off the original cost E with the large weights penalization.
Applying gradient descent to the new cost function we obtain:
WI←w I−η ∂ e ∂w i− ηλwi
< Span id= "Mathjax-element-6-frame" class= "Mathjax" >the new Term −ηλ wi coming from the regularization causes the weight to decay in proportion to its size.
In your solver you likely has a learning rate set as well as weight decay. Lr_mult indicates what to multiply the learning rate by a particular layer. This is useful if you want to update some layers with a smaller learning rate (e.g. when finetuning some layers while trai Ning others from scratch) or if you don't want to update the weights for one layer (perhaps you keep all the conv layers The same and just retrain fully connected layers). Decay_mult is the same, just for weight decay.
Reference: Http://stats.stackexchange.com/questions/29130/difference-between-neural-net-weight-decay-and-learning-rate
http://blog.csdn.net/u010025211/article/details/50055815
Https://groups.google.com/forum/#!topic/caffe-users/8J_J8tc1ZHc
Caffe in Base_lr, Weight_decay, Lr_mult, Decay_mult mean?