1 nonlinear excitation functionSigmoid:11+e−x \frac{1}{1+e^{-x}} Tanh:tanh tanh Softplus:log (1+ex) log (1+e^x) Relu:max (x,0) max (x,0)
1.1 Effects
For non-linear excitation function, each layer of network output is a linear function, can be verified regardless of the number of neural network layer, the output is a linear combination of input, meaning that the single-layer network can also be achieved, which is the most primitive perception machine.
The introduction of nonlinear excitation function makes the deep neural network more meaningful and can simulate more complicated models. 1.2 Relu vs (sigmoid Tanh) when using sigmoid tanh as an activation function, the computational amount is large , and the derivative calculation is very large when the error gradient is obtained in reverse propagation. Sigmoid tanh function is easy to saturate, the gradient disappears , that is, when approaching convergence, the transformation is too slow, resulting in loss of information. Relu will cause some neurons to output 0, resulting in sparsity , not only easing overfitting, but also closer to the real neuron activation model. Bridging the gap with pre-training 2 about pre-training in deep learning 2.1 Why pre-training
Deep networking has the following drawbacks: The deeper the network, the more training samples are needed. If the use of supervision will require a large number of samples, or small-scale samples can easily lead to overfitting. (Deep network means more features, machine learning faces multiple features: 1, multi-Sample 2, Rule 3, Feature selection) the optimization of multi-layered neural network parameters is a high order convex optimization problem, and the local solution with poor convergence is common. Gradient diffusion problem. The gradient calculated by BP algorithm decreases with the depth forward, which leads to the small contribution of the front network parameters and the slow updating speed.
Solution: Layer by step greedy training. Unsupervised pre-training (unsupervised pre-training) is the first hidden layer of the training network, and then the second one is trained ... Finally, the values of these trained network parameters are used as the initial values of the overall network parameters. Unsupervised learning →\rightarrow parameter initial value, supervised learning →\rightarrowfine-tuning, that is, training has labeled samples. The better local optimal solution can be obtained by pre-training. 2.2 Common pre-training methods Stacked RBM Stacked Sparse-autoencoder stacked Denoise-autoencoder 2.3 Why does unsupervised pre-training help deep le Arning?
(This part of the content is compiled from the D Erhan and other people's paper http://machinelearning.wustl.edu/mlpapers/paper_files/AISTATS2010_ErhanCBV10.pdf) 2.3.1 The effect of pre-training
1. Test error is generalization ability (statistical results after multiple random selection of initial points)
have pre-training |
no pre-training |
Small error |
Large error |
Better robustness and less variance when depth increases |
When the depth increases, the robustness is poor and the probability of local optimal solution gets worse. |
2. Feature Angle
Fine-tuning to the neural network weight change is very small, it seems that the weight is trapped in a local area. And the first layer changes the least, the second level of ... The last layer is the largest. It is shown that the weight parameter of shallow layer seems to limit the whole parameter to a certain range, that is, the shallow weight value has great influence on the result, but the BP algorithm will disappear, that is, it is not easy to change the weight parameter of shallow layer.
3. Model Trajectory
(Weights are based on the ISOMAP mapping algorithm, the number of iterations increases, and dark blue turns cyan)
There are many local optimal solutions. Without pre-training, different initial values converge to different local points, and the convergence point is diffused; the pre-training will be more biased towards some points and convergence points.
The role of 2.3.2 pre-training in deep network learning
Hypothesis 1: Pre-training allows weight parameters to be fine-tuning in the perimeter of the optimal parameters (with the same statistical characteristics as the pre-training parameters, the findings are not as pre-trained)
Hypothesis 2: Pre-training makes the optimization process more efficient (calculates training error, in fact, with iteration, the value is less trained)
Hypothesis 3: Pre-training is similar to a regular weight (in terms of test errors, pre-training is more effective for multiple nodes and deeper networks.) )
But the pre-training rule is different from the classical rule (l1/l2), the more training number, the better the pre-training result.
This paper also shows that the deep Learning network based on pre-training, when satisfying big data, deep-seated, multi-node network, has better effect.
References
Http://blog.sina.com.cn/s/blog_628b77f80102v41z.html
Http://machinelearning.wustl.edu/mlpapers/paper_ Files/aistats2010_erhancbv10.pdf
Https://www.researchgate.net/publication/259399568_Why_does_the_ Unsupervised_pretraining_encourage_moderate-sparseness