Dropout principle of activating function of neural network batchnormalization code implementation

Last Update:2018-07-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

activation functions of neural networks (Activation function)

This blog is only for the author to record the use of notes, there are many details of the wrong place.

Also hope that you crossing can forgive, welcome criticism correct.

More related blog please poke: http://blog.csdn.net/cyh_24

If you want to reprint, please attach this article link: http://blog.csdn.net/cyh_24/article/details/50593400

In daily coding, we will naturally use some activation functions, such as: Sigmoid, Relu, and so on. But I seem to have forgotten to ask myself one (n) thing: Why you need to activate a function. What are the activation functions. What they look like. What are the pros and cons. How to choose the activation function.

This article is based on these issues, welcome criticism.

(There is no egg in this picture, purely to install x ...) Why use activation functions?

The activation function usually has the following properties: nonlinearity: When the activation function is linear, a two-layer neural network can approximate almost all functions. However, if the activation function is an identity activation function (i.e. f (x) =x), this property is not satisfied, and if MLP is using an identity activation function, then the entire network is equivalent to a single-layer neural network. Micro-Usability: This property is necessary when the optimization method is based on the gradient. Monotonicity: When the activation function is monotonous, a single-layer network can be guaranteed to be a convex function. F (x) ≈x: When the activation function satisfies this property, if the initialization of the parameter is a small value of random, then the training of the neural network will be very efficient, and if this property is not satisfied then it is necessary to set the initial value very carefully. Range of output values: When the activation function output value is limited, the gradient-based optimization method is more stable, because the representation of the feature is more significantly affected by the finite weights, and when the output of the activation function is infinite, the training of the model is more efficient, but in this case, smaller learning rate is generally required. .

These properties are precisely why we use the activation function. Activation Functions. Sigmoid

Sigmoid is a commonly used non-linear activation function, and its mathematical form is as follows:
F (x) =11+e−x

As mentioned in the previous section, it is able to "compress" the continuous real value of the input to between 0 and 1.
In particular, if it is a very large negative number, then the output is 0; if it is a very large positive number, the output is 1.
The sigmoid function has been used a lot, but in recent years, fewer people have used it. Mainly because of its disadvantages: sigmoids saturate and kill gradients. (How does the word saturate translate?) Saturated. Sigmoid has a very fatal disadvantage, when the input is very large or very small (saturation), the gradient of these neurons is close to 0, from the figure can be seen in the gradient trend. Therefore, you need to pay particular attention to the initial values of the parameters to avoid saturation situations as much as possible. If your initial value is large, most neurons may be in a saturation state and kill gradient, which can cause the network to become difficult to learn. The output of Sigmoid is not a 0-mean value. This is undesirable, as this causes the next layer of neurons to receive input from a non-0 mean signal of the previous layer output.
One result is that if the data enters the neuron it is positive (e.g. X>0 elementwise in F=wtx+b), then the gradient calculated by W will always be positive.
Of course, if you are training by batch, then the batch may get a different signal, so the problem can be mitigated. Therefore, the problem of non-0 mean is much better than the kill gradients problem mentioned above, although it has some bad effects. Tanh

Tanh is the image on the right, you can see that tanh and sigmoid still very much like, in fact, Tanh is the transformation of sigmoid:
Tanh (x) =2si

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Dropout principle of activating function of neural network batchnormalization code implementation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Dropout principle of activating function of neural network batchnormalization code implementation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support