Why should I introduce an activation function?
If you don't have to activate the function (actually equivalent to the excitation function is f (x) =x), in this case you each layer of output is a linear function of the upper input, it is easy to verify that no matter how many layers of your neural network, the output is a linear combination of input, and no hidden layer effect, this is the most primitive perceptron.
For the reasons above, we decided to introduce a nonlinear function as an excitation function, so that the deep neural network would make sense (no longer a linear combination of inputs that could approximate any function). The first idea is the sigmoid function or the Tanh function, which outputs bounded and can easily act as the next level of input. The function of activation is to increase the nonlinearity of the neural network model. Otherwise, you think, each layer that has no activation function is the equivalent of a matrix multiplication, even if you have superimposed a few once, nothing more than a matrix multiplication. So if you don't have a non-linear structure, it doesn't even have a neural network.
It can "compress" the input continuous real value between 0 and 1, especially if it is very large negative, then the output is 0; if it is a very large positive number, the output is 1.
The Signoid function has been used a lot, but in recent years, fewer people have used it. Mainly because of some of his shortcomings:
When the input is very large or very small, the gradient of these neurons is close to 0, you can see the gradient trend;
The output of the sigmoid is not a 0-mean value, which causes the next-level neuron to receive a non-0-mean signal from the previous output as input (why can't it be a non-0 mean???). )
Tanh is sigmoid deformation, unlike sigmoid, Tanh is a 0 mean value, so in practice, Tanh is better than sigmoid.
Relu function
As can be seen, the input negative signal, the output is 0, non-negative case, the output is equal to the input.
Solved the problem of gradient vanishing (in positive interval)
The calculation is very fast, just to determine whether the input is greater than 0
converges much faster than sigmoid and Tanh
Relu also has several issues that require special attention:
1, the output of Relu is not a 0 mean value
2, Dead ReLU problem, refers to some neurons may never be activated, resulting in the corresponding parameters can never be updated. There are two main reasons why this can happen: (1) Very unfortunate parameter initialization, which is rare (2) The learning rate is too high to cause the parameter to update too much during the training process, unfortunately bringing the network into this state. The workaround is to use the Xavier initialization method, and to avoid learning rate setting to automatically adjust learning rate using Adagrad, etc.
Why did you introduce Relu?
First, the use of sigmoid and other functions, calculate the activation function (exponential operation), the calculation is large, the reverse propagation of the error gradient, the derivation involves division, the computational amount is relatively large, and the use of Relu activation function, the whole process of computational energy saving a lot.
Second: For the Deep network, the sigmoid function in reverse propagation, it is easy to have a gradient disappear (when the sigmoid near the saturation zone, the transformation is too slow, resulting in a trend of 0, this situation will cause information loss, so that can not complete the training of deep Network)
Third: Relu will make some neurons output 0, which results in the sparse network, and reduce the interdependence of parameters, to alleviate the problem of overfitting.
There are, of course, some improvements to relu, such as Prelu. Random Relu, there will be some improvement in training speed or accuracy on different data sets.
Now the mainstream practice, will do more batch normalization, as far as possible to ensure that each layer of network input has the same distribution. The latest paper, after adding bypass connection, found that changing the position of batch normalization would have a better effect.
The basic principle of deep learning is based on artificial neural networks, where signals are entered from a neuron, passed through a nonlinear activation function, passed into the next layer of neurons, and then passed through the activate of that layer of neurons, continuing down, so that it repeats itself until the output layer. It is precisely because of the repeated superposition of these nonlinear functions that the neural network has enough capacity to capture the complex pattern and obtain state-of-the-art results in various fields. It is obvious that activation function is one of the most important and active research fields in deep learning.
What is the specific activation function in a neural network? Why Relu better than Tanh and sigmoid function