This article is part of the third chapter of "Neural networks and deep learning", which describes how to select the value of the initial hyper-parameter in the machine learning algorithm. (This article will continue to add)
Learning Rate (learning rate,η)
When using the gradient descent algorithm to optimize, the weight of the update rule, before the gradient is multiplied by a factor, the coefficient is called the learning rate η. The following is a discussion of the strategy of selecting η during training.
- A fixed learning rate. If the learning rate is too small, the convergence will be too slow, if the learning rate is too large, it will cause the cost function oscillation, as shown in. A good strategy, then, is to set the learning rate to 0.25, and then change the learning rate to 0.025 when you train to the 20th epoch.
about why the learning rate is too large to oscillate, look at this picture to know that the green ball and arrows represent the current position, as well as the direction of the gradient, the higher the learning rate, then the more the direction of the arrow, if too large will lead directly across the valley to reach the other end, so-called "step too big, step over the valley.
In practice, how to roughly determine a better learning rate? It seems to be only through trying. You can set the learning rate to 0.01, and then observe the training cost of the trend, if the cost is decreasing, then you can gradually adjust the learning rate, try 0.1,1.0 .... If the cost is increasing, you have to reduce the learning rate, try 0.001,0.0001 ... After an attempt, you can probably determine the appropriate value for the learning rate.
Why is it based on training cost to determine the learning rate, rather than according to validation accuracy? Here is a direct reference to a passage, interested to see:
This is all seems quite straightforward. However, using the training cost to pickηappears to contradict what I said earlier in this section, namely, that we ' d pi CK Hyper-parameters by evaluating performance using our Held-out validation data. In fact, we'll use validation accuracy to pick the regularization hyper-parameter, the Mini-batch size, and network parame Ters such as the number of layers and hidden neurons, and so on. Why does things differently for the learning? Frankly, this choice are my personal aesthetic preference, and is perhaps somewhat idiosyncratic. The reasoning is, the other hyper-parameters be intended to improve, the final classification accuracy on the test set , and so it makes sense to select them on the basis of validation accuracy. However, the learning rate was only incidentally meant to impact the final classification accuracy. It ' s primary purpose is really to control the step size in gradient descent, and monitoring the training To Detect if the step size is too big. With the said, this is a personal aesthetic preference. Early on during learning the training cost usually only decreases if the validation accuracy improves, and so in practice It's unlikely to make much difference which criterion your use.
Early stopping
The so-called early stopping, which calculates the accuracy of validation data at the end of each epoch (an epoch that is a round traversal of all training data), stops training when accuracy no longer increases. This is a natural practice, because accuracy no longer improve, training is useless. In addition, this can also prevent overfitting.
so, how to be considered validation accuracy no longer improve it? It is not that validation accuracy down, it is "no longer improve", because it may pass through the epoch, the accuracy lowered, but then the epoch again let accuracy ascend up, Therefore, it is not possible to judge "no improvement" based on one or two successive reductions. The right thing to do is to record the best validation accuracy during the training process, and when you don't reach the best accuracy in 10 consecutive epochs (or more), you can think of "no more" and use early stopping at this time. This strategy is called "No-improvement-in-n", N is the number of epochs, can be based on the actual situation to take 10, 20, 30 ....
Variable Learning rate
In the front we talked about how to find a better learning rate, the method is to keep trying. In the beginning, we can set it up a little bit so that the weights can be changed a little faster, so that you can see the direction of the cost curve (up or down), and further you decide whether to increase or decrease learning rate.
But the problem is, after finding the right learning rate, the first thing we did was to use this learning rate for the entire process of training the network. This is obviously not a good way, in the process of optimization, learning rate should be gradually reduced, the closer to the "valley" when the "pace" should be smaller.
In the previous cost graph, we said we could set the learning rate to 0.25 and set it to 0.025 at the 20th epoch. This is a manual adjustment, and it is a decision made after drawing the cost graph. Can the program automatically decide at what time to reduce learning rate during the training process?
The answer is yes, and it's a lot of practice. a simple and effective approach is that when validation accuracy satisfies no-improvement-in-n rules, we are going to early stopping, but we can not stop, but let learning Rate is halved, then the program continues to run. The next time validation accuracy satisfies the no-improvement-in-n rule, we also halve the learning rate (at this point One-fourth of the original Learni rate) ... Continue this process until learning rate changes to the original 1/1024 and then terminate the program. (1/1024 or 1/512 or the other can be determined according to the actual). PS: You can also choose to divide learning rate by 10 per time, instead of dividing by 2.
A Readable recent paper which demonstrates the benefits of the variable learning rates in attacking MNIST. "Deep Big simple Neu Ral Nets Excel on handwrittendigit recognition
Regular term coefficients (regularization parameter,λ)
The initial value of the regular item coefficients should be set to how much, as if there is not a better guideline. It is recommended to start by setting the regular term factor λ to 0, first to determine a better learning rate. Then fix the learning rate, give λ a value (such as 1.0), and then according to validation accuracy, the λ will increase or decrease 10 times times (or 10 times times is coarse adjustment, when you determine the appropriate order of λ, such as λ= 0.01, and then further fine adjustment, such as adjusting to 0.02,0.03,0.009. )
In the neural networks:tricks of the trade, chapter III, "A simple Trick for estimating the Weight Decay Parameter", has a discussion on how to estimate the weight decay coefficients, there are The basic reader can look at it.
Mini-batch Size
Let's start by talking about the weight update rules when using Mini-batch. For example, Mini-batch size is set to 100, the rule for weight updates is:
In other words, the gradient of the 100 samples is calculated as the mean, replacing the gradient values of the individual samples in the online learning method:
When using Mini-batch, we can put all the samples in a batch in a matrix, using the linear algebra library to accelerate the calculation of the gradient, which is an optimization method in engineering implementation.
So, how big is size? a large batch, can make full use of matrix, linear algebra library to accelerate the calculation, the smaller batch, the acceleration effect may be less obvious. Of course, batch is not the bigger the better, too big, the weight of the update will be less frequent, resulting in the optimization process is too long. So mini-batch size, not static, according to your data set size, your device computing ability to choose.
The the-Go is therefore-use some acceptable (but not necessarily-optimal) values for the other hyper-parameters, and Then trial a number of different mini-batch sizes, scalingηas above. Plot the validation accuracy versus time (as in, real elapsed time, not epoch!), and choose whichever mini-batch size give S rapid improvement in performance. With the Mini-batch size chosen your can then proceed to optimize the other hyper-parameters.
More Information
LeCun in the 1998 paper "Efficient Backprop"
Bengio's 2012 thesis, Practical recommendations for gradient-based training of deep architectures, gives some suggestions, including gradient descent, Select the detailed details of the hyper-parameter.
The above two papers are included in the 2012 book "Neural Networks:tricks of the trade" inside, this book gives a lot of other Tricks.
Reprint Please specify source: http://blog.csdn.net/u012162613/article/details/44265967
How to select Super Parameters in machine learning algorithm: Learning rate, regular term coefficient, minibatch size