Neural network activation function and derivative

Source: Internet
Author: User
Tags webp

ICML 2016 's article [Noisy Activation Functions] gives the definition of an activation function: The activation function is a map h:r→r and is almost everywhere.

The main function of the activation function in neural network is to provide the nonlinear modeling ability of the network, if not specifically, the activation function is generally nonlinear function. Assuming that a sample neural network contains only linear convolution and full-join operations, then the network can only express linear mappings, even if the depth of the network is still linear mapping, it is difficult to effectively model the data of nonlinear distribution in the real environment. With the addition of the (non-linear) activation function, the deep neural network has the layered nonlinear mapping learning ability.

1. sigmoid function

Sigmoid is the most widely used type of activation function, with exponential function shape. Formally defined as:

650) this.width=650; "src=" Http://s5.51cto.com/wyfs02/M00/85/51/wKiom1egBPeRMTvjAAAascpKaAk596.jpg "title=" 640. Webp.jpg "alt=" Wkiom1egbpermtvjaaaascpkaak596.jpg "/>

Code:

X=-10:0.001:10; %sigmoid and its derivative sigmoid=1./(1+exp (x)); Sigmoidder=exp (-X)./((1+exp (x)). ^2) Figure;plot (x,sigmoid, ' R ', X, Sigmoidder, ' b--'), axis ([ -10 10-1 1]), grid on;title (' sigmoid function (solid line) and its derivative (dashed lines) '), Legend (' Sigmoid original function ', ' sigmid derivative '); set ( GCF, ' numbertitle ', ' off '), set (GCF, ' Name ', ' sigmoid function (solid line) and its derivative (dashed) ');

Output:

650) this.width=650; "src=" Http://s5.51cto.com/wyfs02/M01/85/51/wKiom1egBxOBiT26AABfVHYjxlU640.jpg "title=" Untitled.jpg "alt=" Wkiom1egbxobit26aabfvhyjxlu640.jpg "/>

It can be seen that the sigmoid is everywhere in the definition domain, and the derivative of both sides is gradually approaching 0, namely:

650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M02/85/51/wKiom1egCISRTx8YAAAwZM2dcj0558.jpg "title=" 640. WEBP (1). jpg "alt=" wkiom1egcisrtx8yaaawzm2dcj0558.jpg "/>

Professor Bengio the activation function with such properties is defined as a soft-saturated activation function. Similar to the definition of limits, saturation is also divided into left soft saturation and right soft saturation:

Soft saturation on the left:

650) this.width=650; "src=" Http://s4.51cto.com/wyfs02/M01/85/51/wKioL1egCPPyiLciAAAZyDsSrvA710.jpg "title=" 640. WEBP (2). jpg "alt=" wkiol1egcppyilciaaazydssrva710.jpg "/>

Right soft saturation:

650) this.width=650; "src=" Http://s2.51cto.com/wyfs02/M01/85/51/wKiom1egCQLS6Pi2AAAXnt-EnL0768.jpg "title=" 640. WEBP (3). jpg "alt=" wkiom1egcqls6pi2aaaxnt-enl0768.jpg "/>

In contrast to soft saturation is the hard saturation activation function, i.e.: F ' (x) = 0, when |x| > C, where C is a constant.

Similarly, hard saturation is also divided into hard saturation on the left and hard saturation on the right. The common Relu is a kind of left hard saturation activation function.

The soft saturation of Sigmoid, which makes deep neural network difficult to train effectively for twenty or thirty years, is an important reason to hinder the development of neural networks. Specifically, since the sigmoid downward conduction gradient contains an F ' (x) factor (sigmoid about the derivative of the input) during the back-transfer process, F ' (x) becomes close to 0 once the input falls into the saturation zone, resulting in a very small gradient being passed to the underlying. At this time, network parameters are difficult to be effectively trained. This phenomenon is called gradient disappearance. In general, the sigmoid network within the 5 layer will produce gradient vanishing phenomenon [understanding the difficulty of training deep feedforward neural]. The gradient vanishing problem still exists, but it is effectively alleviated by the new optimization method, such as layered pre-training in DBN, hierarchical normalization of Batch normalization, Xavier and MSRA weight initialization and other representative techniques.

The saturation of the Sigmoid, while causing the gradient to disappear, has its beneficial side. It is, for example, closest to biological neurons in physical sense. The output of (0, 1) can also be expressed as probability, or normalized for input, representative such as Sigmoid cross-entropy loss function


2. Tanh function

Code:

X=-10:0.001:10;tanh= (exp (x)-exp (-X))./(exp (x) +exp (×)); Tanhder=1-tanh.^2;figure;plot (X,tanh, ' R ', X,tanhder, ' b-- Grid on;title (' Tanh function (solid line) and its derivative (dashed lines) '), Legend (' Tanh original function ', ' tanh derivative '), set (GCF, ' numbertitle ', ' off '), set (GCF, ' Name ', ' Tanh function (solid line) and its derivative (dashed) ');

Output:

650) this.width=650; "src=" Http://s1.51cto.com/wyfs02/M01/85/51/wKiom1egB_-jR1kVAABi8SqQyN4421.jpg "title=" Untitled.jpg "alt=" Wkiom1egb_-jr1kvaabi8sqqyn4421.jpg "/>

Tanh also has soft saturation. [BackPropagation applied to handwritten zip code recognition] mentions that the Tanh network converges faster than sigmoid. Because the output mean of the tanh is closer to the sigmoid than the 0,SGD will be closer to natural gradient[natural gradient works efficiently in learning] (a two-time optimization technique), Thus reducing the number of iterations required.


3. softsign function

Code:

X=-10:0.001:10;softsign=x./(1+abs (x)); The% segment function is represented by the following%y=sqrt (x). * (x>=0&x<4) +2* (x>=4&x<6) + (5-x /2). * (x>=6&x<8) +1* (x>=8); softsignder= (1./(1+x). ^2). * (x>=0) + (1./(1-x). ^2). * (x<0);p lot (x, Softsign, ' R ', X,softsignder, ' b--'); axis ([-10 10-1 1]);% plus after the first plot, grid on;title (' Softsign function x/(1+|x|) (solid line) and its derivative (dashed) '), Legend (' Softsign original function ', ' softsign derivative '), set (GCF, ' numbertitle ', ' off '), set (GCF, ' Name ', ' Softsign function x/(1 + |x|) (solid line) and its derivative (dashed) ');

Output:

650) this.width=650; "src=" Http://s4.51cto.com/wyfs02/M02/85/51/wKiom1egDV3glK68AABnNF53MVs554.jpg "title=" Untitled.jpg "alt=" Wkiom1egdv3glk68aabnnf53mvs554.jpg "/>


4, RELU

Defined as:

650) this.width=650; "src=" Http://s1.51cto.com/wyfs02/M02/85/51/wKioL1egDDCSbpPPAABWNniXytU084.jpg "title=" 640. WEBP (4). jpg "alt=" wkiol1egddcsbpppaabwnnixytu084.jpg "/>

Code:

X=-10:0.001:10;relu=max (0,x); The% segment function is represented by the following%y=sqrt (x). * (x>=0&x<4) +2* (x>=4&x<6) + (5-X/2). * (x >=6&x<8) +1* (x>=8), reluder=0.* (x<0) +1.* (x>=0), Figure;plot (X,relu, ' R ', X,reluder, ' b--'), title ( ' Relu function Max (0,x) (solid line) and its derivative 0,1 (dashed) '), Legend (' Relu original function ', ' relu derivative '), set (GCF, ' numbertitle ', ' off '), set (GCF, ' Name ', ' Relu function (solid line) and its derivative (dashed) ');

Output:

650) this.width=650; "src=" Http://s1.51cto.com/wyfs02/M02/85/51/wKiom1egC-aj7mTmAABD4AJEBzs835.jpg "title=" Untitled.jpg "alt=" Wkiom1egc-aj7mtmaabd4ajebzs835.jpg "/>

As can be seen, ReLU is hard saturated when x<0. Since the derivative of x>0 is 1, the ReLU is able to maintain the gradient without decay at x>0, thus alleviating the problem of gradient extinction. However, as the training progresses, some inputs fall into the hard saturation zone, causing the corresponding weights to not be updated. This phenomenon is known as "neuronal death".

Relu is also often "criticized" a problem is that the output has an offset phenomenon [7], that is, the output mean constant greater than 0. The migration phenomenon and neuronal death can affect the convergence of the network together.

There are some other activation functions, such as the following table:

650) this.width=650; "src=" Http://s5.51cto.com/wyfs02/M02/85/51/wKioL1egDwbipT6yAAG3C_q-o1I370.png "title=" Blog-ac1.png "alt=" Wkiol1egdwbipt6yaag3c_q-o1i370.png "/>

http://mp.weixin.qq.com/s?__biz=MzI1NTE4NTUwOQ==&mid=2650325236&idx=1&sn= 7bd8510d59ddc14e5d4036f2acaeaf8d&scene=23&srcid=0801glltvomapzbi0xvx9ys7#rd

http://blog.csdn.net/memray/article/details/51442059



This article from "It Technology Learning and communication" blog, declined reprint!

Neural network activation function and derivative

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.