An important reason for introducing activation function in neural networks is to introduce nonlinearity. 1.sigmoid
Mathematically, the nonlinear sigmoid function has a large signal gain to the Central and small signal gain on both sides. From the point of view of neuroscience, the central region resembles the excited state of neurons, and the two sides are similar to the inhibitory states of neurons, so in the study of neural networks, the key features can be pushed to the central area and the non-key features pushed to the two sides.
function in the form of. Its advantage is that the output range is (0, 1), so it can be used as the output layer, with the output value to represent the probability. Also called the logistic function, there is a two classification application called Logistic regression, using the sigmoid function to get a probability value. , and its derivation is also very convenient, after the derivation of the result is,. Here is the function image of sigmoid and its derivative:
We can find that when sigmoid is x>>0, the value of the function is approaching 1, and the function value is approaching 0 at x<<0. In addition, it can be found that the function at both ends of the gradient is small, which is also the disadvantage of sigmoid, at these x values, the gradient is easily saturated, resulting in parameters can not be updated or updated very slowly. 2.tanh
The form of Tanh is. The basic property does not have much in common with sigmoid, but maps the value to the [ -1,1] interval. Although it is also non-linear, there are still gradient saturation cases, but the saturation period is delayed than the sigmoid function. Its function image is as follows:
3.ReLu
Relu is also called a modified linear element, which is a linear activation function. It is proposed to eliminate the above-mentioned gradient saturation, and its gradient is also very good to find out. In general, the activation function of the neural network now uses Relu by default. represented as f (x) = max (0,x). Its function image is:
The characteristics of unilateral inhibition, in the place of <0 inhibition, other places are activated. 4.maxout
Maxout model is actually a new type of activation function, in the Feedforward neural network, the output of the maxout is the maximum value of the layer, in the convolutional neural Network, a maxout feature map can be from a plurality of feature map to take the most worth.
Maxout's fitting ability is very strong, it can fit arbitrary convex function. But it needs to set a K-value as much as dropout.
For the sake of understanding, it is assumed that there is a neural network consisting of 1 nodes in layer I with 2 nodes (i+1).
Activation value out = f (w.x+b); f is the activation function. ’.’ Here stands for inner product;
So when we use Maxout (set k=5) on the (i+1) layer and then output it, the situation changes.
At this point the network form on the appearance of the above, with a formula to show that is:
Z1 = W1.X+B1;
z2 = w2.x+b2;
Z3 = w3.x+b3;
Z4 = W4.X+B4;
Z5 = w4.x+b5;
out = max (Z1,Z2,Z3,Z4,Z5);
That is, the activation value of the (i+1) layer is calculated 5 times, but we only need 1 activation value, then we should do. In fact, the above narrative has given the answer, take the 5 maximum value as the final result.
Summing up, Maxout obviously increased the computational capacity of the network, so that the application of the maxout layer of the number of parameters to the K-fold increase, originally only need 1 groups can be, the use of maxout after the K times.
Then describe a slightly more complex application Maxout network, the network diagram is as follows:
A description of the above figure, the first layer has 3 nodes, the red dot, and the first (i+1) layer has 4 nodes, with colored dots, at this time in the (i+1) layer adopted maxout (k=3). We see that the activation value of each node in the I+1 layer has 3 values, and the maximum value of 3 calculations is the final activation value of the corresponding point. The main purpose of this example is to illustrate that the activation value of the node is determined not by the layer, but by the nodes.
Reference:
Https://www.cnblogs.com/neopenx/p/4453161.html
http://www.sohu.com/a/146005028_723464
http://blog.csdn.net/hjimce/article/details/50414467