See above
-collect high-quality callout data
-Input and output data are normalized to prevent numerical problems, and the method is the principal component analysis of what.
-Initialization of parameters is important. Too small, the parameters are not moving at all. General weight parameter 0.01 mean variance, 0 mean value of Gaussian distribution is omnipotent, not to try to bigger. The deviation parameter is all 0.
-with SGD, Minibatch size 128. or smaller size, but the throughput becomes smaller and the computational efficiency is lower.
-With SGD with momentum, the second-order method is not worth mentioning.
-The step size of the gradient update is important. General 0.1 is a universal value. The adjustment can improve the result, the specific practice is human flesh supervision: Observe the test error rate with another validation set, once not dropped, step by half or more.
-Gradient normalization: divide by minibatch size so that you do not explicitly depend on Minibatch size
-Limit the maximum value of the weight parameter to prevent running and flying. The general Maximum line norm does not exceed 2 or 4, otherwise it shrinks to this value.
-The gradient should roughly always only change the parameters of 1 per thousand, deviating from this number too far, tune it.
-Dropout must be used
-Relu must be used
It is better to give people and fish than to give them. CNN Assistant, the best reference paper is that nips2012 CNN do Imagenet, not one. Dropout that article can be the best supplement.
Let CNN run, here are all the secrets of the tune-up.