Dropout principle of activating function of neural network batchnormalization code implementation

Source: Internet
Author: User
activation functions of neural networks (Activation function)

This blog is only for the author to record the use of notes, there are many details of the wrong place.

Also hope that you crossing can forgive, welcome criticism correct.

More related blog please poke: http://blog.csdn.net/cyh_24

If you want to reprint, please attach this article link: http://blog.csdn.net/cyh_24/article/details/50593400

In daily coding, we will naturally use some activation functions, such as: Sigmoid, Relu, and so on. But I seem to have forgotten to ask myself one (n) thing: Why you need to activate a function. What are the activation functions. What they look like. What are the pros and cons. How to choose the activation function.

This article is based on these issues, welcome criticism.


(There is no egg in this picture, purely to install x ...) Why use activation functions?

The activation function usually has the following properties: nonlinearity: When the activation function is linear, a two-layer neural network can approximate almost all functions. However, if the activation function is an identity activation function (i.e. f (x) =x), this property is not satisfied, and if MLP is using an identity activation function, then the entire network is equivalent to a single-layer neural network. Micro-Usability: This property is necessary when the optimization method is based on the gradient. Monotonicity: When the activation function is monotonous, a single-layer network can be guaranteed to be a convex function. F (x) ≈x: When the activation function satisfies this property, if the initialization of the parameter is a small value of random, then the training of the neural network will be very efficient, and if this property is not satisfied then it is necessary to set the initial value very carefully. Range of output values: When the activation function output value is limited, the gradient-based optimization method is more stable, because the representation of the feature is more significantly affected by the finite weights, and when the output of the activation function is infinite, the training of the model is more efficient, but in this case, smaller learning rate is generally required. .

These properties are precisely why we use the activation function. Activation Functions. Sigmoid

Sigmoid is a commonly used non-linear activation function, and its mathematical form is as follows:
F (x) =11+e−x

As mentioned in the previous section, it is able to "compress" the continuous real value of the input to between 0 and 1.
In particular, if it is a very large negative number, then the output is 0; if it is a very large positive number, the output is 1.
The sigmoid function has been used a lot, but in recent years, fewer people have used it. Mainly because of its disadvantages: sigmoids saturate and kill gradients. (How does the word saturate translate?) Saturated. Sigmoid has a very fatal disadvantage, when the input is very large or very small (saturation), the gradient of these neurons is close to 0, from the figure can be seen in the gradient trend. Therefore, you need to pay particular attention to the initial values of the parameters to avoid saturation situations as much as possible. If your initial value is large, most neurons may be in a saturation state and kill gradient, which can cause the network to become difficult to learn. The output of Sigmoid is not a 0-mean value. This is undesirable, as this causes the next layer of neurons to receive input from a non-0 mean signal of the previous layer output.
One result is that if the data enters the neuron it is positive (e.g. X>0 elementwise in F=wtx+b), then the gradient calculated by W will always be positive.
Of course, if you are training by batch, then the batch may get a different signal, so the problem can be mitigated. Therefore, the problem of non-0 mean is much better than the kill gradients problem mentioned above, although it has some bad effects. Tanh

Tanh is the image on the right, you can see that tanh and sigmoid still very much like, in fact, Tanh is the transformation of sigmoid:
Tanh (x) =2si

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.