ICML 2016 's article [Noisy Activation Functions] gives the definition of an activation function: The activation function is a map h:r→r and is almost everywhere.
The main function of the activation function in neural network is to provide the nonlinear modeling ability of the network, if not specifically, the activation function is generally nonlinear function. Assuming that a sample neural network contains only linear convolution and full-join operations, then the network can only express linear mappings, even if the depth of the network is still linear mapping, it is difficult to effectively model the data of nonlinear distribution in the real environment. With the addition of the (non-linear) activation function, the deep neural network has the layered nonlinear mapping learning ability.
1. sigmoid function
Sigmoid is the most widely used type of activation function, with exponential function shape. Formally defined as:
650) this.width=650; "src=" Http://s5.51cto.com/wyfs02/M00/85/51/wKiom1egBPeRMTvjAAAascpKaAk596.jpg "title=" 640. Webp.jpg "alt=" Wkiom1egbpermtvjaaaascpkaak596.jpg "/>
Code:
X=-10:0.001:10; %sigmoid and its derivative sigmoid=1./(1+exp (x)); Sigmoidder=exp (-X)./((1+exp (x)). ^2) Figure;plot (x,sigmoid, ' R ', X, Sigmoidder, ' b--'), axis ([ -10 10-1 1]), grid on;title (' sigmoid function (solid line) and its derivative (dashed lines) '), Legend (' Sigmoid original function ', ' sigmid derivative '); set ( GCF, ' numbertitle ', ' off '), set (GCF, ' Name ', ' sigmoid function (solid line) and its derivative (dashed) ');
Output:
650) this.width=650; "src=" Http://s5.51cto.com/wyfs02/M01/85/51/wKiom1egBxOBiT26AABfVHYjxlU640.jpg "title=" Untitled.jpg "alt=" Wkiom1egbxobit26aabfvhyjxlu640.jpg "/>
It can be seen that the sigmoid is everywhere in the definition domain, and the derivative of both sides is gradually approaching 0, namely:
650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M02/85/51/wKiom1egCISRTx8YAAAwZM2dcj0558.jpg "title=" 640. WEBP (1). jpg "alt=" wkiom1egcisrtx8yaaawzm2dcj0558.jpg "/>
Professor Bengio the activation function with such properties is defined as a soft-saturated activation function. Similar to the definition of limits, saturation is also divided into left soft saturation and right soft saturation:
Soft saturation on the left:
650) this.width=650; "src=" Http://s4.51cto.com/wyfs02/M01/85/51/wKioL1egCPPyiLciAAAZyDsSrvA710.jpg "title=" 640. WEBP (2). jpg "alt=" wkiol1egcppyilciaaazydssrva710.jpg "/>
Right soft saturation:
650) this.width=650; "src=" Http://s2.51cto.com/wyfs02/M01/85/51/wKiom1egCQLS6Pi2AAAXnt-EnL0768.jpg "title=" 640. WEBP (3). jpg "alt=" wkiom1egcqls6pi2aaaxnt-enl0768.jpg "/>
In contrast to soft saturation is the hard saturation activation function, i.e.: F ' (x) = 0, when |x| > C, where C is a constant.
Similarly, hard saturation is also divided into hard saturation on the left and hard saturation on the right. The common Relu is a kind of left hard saturation activation function.
The soft saturation of Sigmoid, which makes deep neural network difficult to train effectively for twenty or thirty years, is an important reason to hinder the development of neural networks. Specifically, since the sigmoid downward conduction gradient contains an F ' (x) factor (sigmoid about the derivative of the input) during the back-transfer process, F ' (x) becomes close to 0 once the input falls into the saturation zone, resulting in a very small gradient being passed to the underlying. At this time, network parameters are difficult to be effectively trained. This phenomenon is called gradient disappearance. In general, the sigmoid network within the 5 layer will produce gradient vanishing phenomenon [understanding the difficulty of training deep feedforward neural]. The gradient vanishing problem still exists, but it is effectively alleviated by the new optimization method, such as layered pre-training in DBN, hierarchical normalization of Batch normalization, Xavier and MSRA weight initialization and other representative techniques.
The saturation of the Sigmoid, while causing the gradient to disappear, has its beneficial side. It is, for example, closest to biological neurons in physical sense. The output of (0, 1) can also be expressed as probability, or normalized for input, representative such as Sigmoid cross-entropy loss function
2. Tanh function
Code:
X=-10:0.001:10;tanh= (exp (x)-exp (-X))./(exp (x) +exp (×)); Tanhder=1-tanh.^2;figure;plot (X,tanh, ' R ', X,tanhder, ' b-- Grid on;title (' Tanh function (solid line) and its derivative (dashed lines) '), Legend (' Tanh original function ', ' tanh derivative '), set (GCF, ' numbertitle ', ' off '), set (GCF, ' Name ', ' Tanh function (solid line) and its derivative (dashed) ');
Output:
650) this.width=650; "src=" Http://s1.51cto.com/wyfs02/M01/85/51/wKiom1egB_-jR1kVAABi8SqQyN4421.jpg "title=" Untitled.jpg "alt=" Wkiom1egb_-jr1kvaabi8sqqyn4421.jpg "/>
Tanh also has soft saturation. [BackPropagation applied to handwritten zip code recognition] mentions that the Tanh network converges faster than sigmoid. Because the output mean of the tanh is closer to the sigmoid than the 0,SGD will be closer to natural gradient[natural gradient works efficiently in learning] (a two-time optimization technique), Thus reducing the number of iterations required.
3. softsign function
Code:
X=-10:0.001:10;softsign=x./(1+abs (x)); The% segment function is represented by the following%y=sqrt (x). * (x>=0&x<4) +2* (x>=4&x<6) + (5-x /2). * (x>=6&x<8) +1* (x>=8); softsignder= (1./(1+x). ^2). * (x>=0) + (1./(1-x). ^2). * (x<0);p lot (x, Softsign, ' R ', X,softsignder, ' b--'); axis ([-10 10-1 1]);% plus after the first plot, grid on;title (' Softsign function x/(1+|x|) (solid line) and its derivative (dashed) '), Legend (' Softsign original function ', ' softsign derivative '), set (GCF, ' numbertitle ', ' off '), set (GCF, ' Name ', ' Softsign function x/(1 + |x|) (solid line) and its derivative (dashed) ');
Output:
650) this.width=650; "src=" Http://s4.51cto.com/wyfs02/M02/85/51/wKiom1egDV3glK68AABnNF53MVs554.jpg "title=" Untitled.jpg "alt=" Wkiom1egdv3glk68aabnnf53mvs554.jpg "/>
4, RELU
Defined as:
650) this.width=650; "src=" Http://s1.51cto.com/wyfs02/M02/85/51/wKioL1egDDCSbpPPAABWNniXytU084.jpg "title=" 640. WEBP (4). jpg "alt=" wkiol1egddcsbpppaabwnnixytu084.jpg "/>
Code:
X=-10:0.001:10;relu=max (0,x); The% segment function is represented by the following%y=sqrt (x). * (x>=0&x<4) +2* (x>=4&x<6) + (5-X/2). * (x >=6&x<8) +1* (x>=8), reluder=0.* (x<0) +1.* (x>=0), Figure;plot (X,relu, ' R ', X,reluder, ' b--'), title ( ' Relu function Max (0,x) (solid line) and its derivative 0,1 (dashed) '), Legend (' Relu original function ', ' relu derivative '), set (GCF, ' numbertitle ', ' off '), set (GCF, ' Name ', ' Relu function (solid line) and its derivative (dashed) ');
Output:
650) this.width=650; "src=" Http://s1.51cto.com/wyfs02/M02/85/51/wKiom1egC-aj7mTmAABD4AJEBzs835.jpg "title=" Untitled.jpg "alt=" Wkiom1egc-aj7mtmaabd4ajebzs835.jpg "/>
As can be seen, ReLU is hard saturated when x<0. Since the derivative of x>0 is 1, the ReLU is able to maintain the gradient without decay at x>0, thus alleviating the problem of gradient extinction. However, as the training progresses, some inputs fall into the hard saturation zone, causing the corresponding weights to not be updated. This phenomenon is known as "neuronal death".
Relu is also often "criticized" a problem is that the output has an offset phenomenon [7], that is, the output mean constant greater than 0. The migration phenomenon and neuronal death can affect the convergence of the network together.
There are some other activation functions, such as the following table:
650) this.width=650; "src=" Http://s5.51cto.com/wyfs02/M02/85/51/wKioL1egDwbipT6yAAG3C_q-o1I370.png "title=" Blog-ac1.png "alt=" Wkiol1egdwbipt6yaag3c_q-o1i370.png "/>
http://mp.weixin.qq.com/s?__biz=MzI1NTE4NTUwOQ==&mid=2650325236&idx=1&sn= 7bd8510d59ddc14e5d4036f2acaeaf8d&scene=23&srcid=0801glltvomapzbi0xvx9ys7#rd
http://blog.csdn.net/memray/article/details/51442059
This article from "It Technology Learning and communication" blog, declined reprint!
Neural network activation function and derivative