Solution of gradient vanishing/gradient explosion
Firstly, the fundamental reason of gradient vanishing and gradient explosion is the inverse propagation algorithm based on BP
And the above reverse propagation error is less than 1/4
In general, when you update W and B, the updated step is proportional to the learningrate, when the lower the number of layers, the value of W of each layer and the value of the reverse propagation error multiply, resulting in the W and b update step size is greatly affected, resulting in gradient explosion or gradient disappeared. At this time the depth of the network can not be better than the performance of the thousand-tier network. At the back of the basic learning situation is good, and the shallow network can not learn things. The exponential gradient disappears in the sigmoid network.
There are probably several strategies for this.
Each layer of the network to learn at different rates of learning
Replace activation function
Using the Relu activation function, simplifying the calculation and solving the gradient vanishing problem, and the output of some neurons is 0, it can make the network sparse, reduce the dependence of parameters, and mitigate the occurrence of the fitting.
Using Batch Normolization
The data should be normalized before training the network. The purpose is: the essence of neural network is learning data distribution, if you find your data and test data distribution, network generalization ability will be reduced, and in each batch of training data, the network training speed will be reduced.
The batch normolization can solve the problem of gradient disappearance, which makes the weight change of different layers different scale more consistent, and can speed up the training. Precede the non-linear mapping of each layer of the network before activating the function. What to do with local optimal solution
Simulated annealing
Add momentum not convergent what's going on, how to solve
Too little data
Learningrate too big
may lead to not converge from the beginning, each layer of w is very large, or running loss suddenly become very large (generally because the network before using Relu as the activation function and the last layer using Softmax as a classification of functions caused)
Bad network structure
Replace the other optimization algorithm
I ran into the test once, Adam did not converge, with the simplest SGD convergence. Specific reasons unknown
Normalization of parameters
is to normalized the input to a mean 0 variance of 1 and then use bn, etc.
Class
To modify an initialization scheme to fit
Increase the amount of data in the training set
The words of the image can be translated and reversed plus noise
Using the Relu activation function
Dropout
Each iteration training randomly selects a portion of nodes for training and weight updating, while the other part weights remain unchanged. When testing, the output is obtained using the mean network network.
Regularization
Also to simplify the network, add the L2 norm
Early termination of training how to improve the effect. /Maybe there is a lack of fit
Increase the number of features
The original input is only coordinate position, and then add a color feature after the fitting
Study on reducing parameter initialization of regularization term
http://m.blog.csdn.net/shwan_ma/article/details/76257967 Optimization method
Http://www.cnblogs.com/wuxiangli/p/7258510.html