Nonlinear excitation function and unsupervised pre-training in deep learning

Last Update:2018-07-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1 nonlinear excitation functionSigmoid:11+e−x \frac{1}{1+e^{-x}} Tanh:tanh tanh Softplus:log (1+ex) log (1+e^x) Relu:max (x,0) max (x,0)

1.1 Effects

For non-linear excitation function, each layer of network output is a linear function, can be verified regardless of the number of neural network layer, the output is a linear combination of input, meaning that the single-layer network can also be achieved, which is the most primitive perception machine.
The introduction of nonlinear excitation function makes the deep neural network more meaningful and can simulate more complicated models. 1.2 Relu vs (sigmoid Tanh) when using sigmoid tanh as an activation function, the computational amount is large , and the derivative calculation is very large when the error gradient is obtained in reverse propagation. Sigmoid tanh function is easy to saturate, the gradient disappears , that is, when approaching convergence, the transformation is too slow, resulting in loss of information. Relu will cause some neurons to output 0, resulting in sparsity , not only easing overfitting, but also closer to the real neuron activation model. Bridging the gap with pre-training 2 about pre-training in deep learning 2.1 Why pre-training

Deep networking has the following drawbacks: The deeper the network, the more training samples are needed. If the use of supervision will require a large number of samples, or small-scale samples can easily lead to overfitting. (Deep network means more features, machine learning faces multiple features: 1, multi-Sample 2, Rule 3, Feature selection) the optimization of multi-layered neural network parameters is a high order convex optimization problem, and the local solution with poor convergence is common. Gradient diffusion problem. The gradient calculated by BP algorithm decreases with the depth forward, which leads to the small contribution of the front network parameters and the slow updating speed.

Solution: Layer by step greedy training. Unsupervised pre-training (unsupervised pre-training) is the first hidden layer of the training network, and then the second one is trained ... Finally, the values of these trained network parameters are used as the initial values of the overall network parameters. Unsupervised learning →\rightarrow parameter initial value, supervised learning →\rightarrowfine-tuning, that is, training has labeled samples. The better local optimal solution can be obtained by pre-training. 2.2 Common pre-training methods Stacked RBM Stacked Sparse-autoencoder stacked Denoise-autoencoder 2.3 Why does unsupervised pre-training help deep le Arning?

(This part of the content is compiled from the D Erhan and other people's paper http://machinelearning.wustl.edu/mlpapers/paper_files/AISTATS2010_ErhanCBV10.pdf) 2.3.1 The effect of pre-training

1. Test error is generalization ability (statistical results after multiple random selection of initial points)

have pre-training	no pre-training
Small error	Large error
Better robustness and less variance when depth increases	When the depth increases, the robustness is poor and the probability of local optimal solution gets worse.

2. Feature Angle
Fine-tuning to the neural network weight change is very small, it seems that the weight is trapped in a local area. And the first layer changes the least, the second level of ... The last layer is the largest. It is shown that the weight parameter of shallow layer seems to limit the whole parameter to a certain range, that is, the shallow weight value has great influence on the result, but the BP algorithm will disappear, that is, it is not easy to change the weight parameter of shallow layer.

3. Model Trajectory
(Weights are based on the ISOMAP mapping algorithm, the number of iterations increases, and dark blue turns cyan)
There are many local optimal solutions. Without pre-training, different initial values converge to different local points, and the convergence point is diffused; the pre-training will be more biased towards some points and convergence points.

The role of 2.3.2 pre-training in deep network learning

Hypothesis 1: Pre-training allows weight parameters to be fine-tuning in the perimeter of the optimal parameters (with the same statistical characteristics as the pre-training parameters, the findings are not as pre-trained)
Hypothesis 2: Pre-training makes the optimization process more efficient (calculates training error, in fact, with iteration, the value is less trained)
Hypothesis 3: Pre-training is similar to a regular weight (in terms of test errors, pre-training is more effective for multiple nodes and deeper networks.) ）

But the pre-training rule is different from the classical rule (l1/l2), the more training number, the better the pre-training result.
This paper also shows that the deep Learning network based on pre-training, when satisfying big data, deep-seated, multi-node network, has better effect.

References

Http://blog.sina.com.cn/s/blog_628b77f80102v41z.html
Http://machinelearning.wustl.edu/mlpapers/paper_ Files/aistats2010_erhancbv10.pdf
Https://www.researchgate.net/publication/259399568_Why_does_the_ Unsupervised_pretraining_encourage_moderate-sparseness

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Nonlinear excitation function and unsupervised pre-training in deep learning

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Nonlinear excitation function and unsupervised pre-training in deep learning

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support