Common activation function comparison

Last Update:2018-10-11 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The structure of this article:

What is an activation function
Why do you use
What do we have?
Comparison of Sigmoid, ReLU and Softmax
How to choose

1. What is an activation function

For example, in a neuron, the input inputs is weighted, summed, and is also acting as a function, which is the function that activates the function Activation.

2. Why use

Without the excitation function, each layer of output is a linear function of the upper input, no matter how many layers the neural network has, the output is a linear combination of inputs.

If used, the activation function introduces nonlinear factors into the neuron so that the neural network can arbitrarily approximate any nonlinear function so that the neural network can be applied to many non-linear models.

3. What do we have?

(1) sigmoid function

Formula:

Curve:

Also called a logistic function for the output of a hidden layer neuron

Range of values (0, 1)

It can map a real number to an interval of (0, 1) that can be used as a two classification.

The effect is better when the characteristic difference is more complex or the difference is not particularly large.

The following explains why gradients disappear:

In the inverse propagation algorithm, the derivative expression of the sigmoid is the derivation of the activation function:

sigmoid original function and derivative graph as follows:

The figure shows that the derivative starts from 0 and soon approaches 0, causing a "gradient vanishing" phenomenon.

(2) Tanh function

Formula

Curve

Also known as the double tangent tangent function

Value range is [-1, 1]

The effect of Tanh is very good when the characteristic difference is obvious, it will enlarge the characteristic effect in the cycle.

The difference with sigmoid is that the Tanh is 0 mean, so tanh is better than sigmoid in practical applications

(3) ReLU

Rectified Linear Unit (ReLU)-for hidden-layer neuron output

Formula

Curve

When the input signal <0, the output is 0,>0, the output equals the input

Krizhevsky et al. It is found that the rate of convergence of SGD obtained using ReLU is much faster than Sigmoid/tanh.

Disadvantages of Relu:

Training is very "fragile", it is easy to "die"

For example, a very large gradient flows through a relu neuron, and after updating the parameters, the neurons will no longer be active on any data, so the neuron's gradient will always be 0.

If learning rate is large, then it is likely that 40% of the neurons in the network are "dead".

(4) Softmax function

Softmax-for multi-classification neural network output

Formula

Take an example to see what the formula means:

If one ZJ is larger than the other z, then the component of this mapping is approximated to 1, the other is approaching 0, and the main application is multi-classification.

The first reason to take an exponent is to emulate Max's behavior, so make it bigger.

The second reason is that you need a function that can be directed.

4. Comparison of Sigmoid,relu,softmax

Comparison of Sigmoid and Relu:

Sigmoid gradient vanishing problem, the derivative of relu does not exist such a problem, its function expression is as follows:

Curve

The main changes in the comparison of sigmoid class functions are:

Single-sided suppression
The relatively broad excitement borders
Sparse activation

The difference between sigmoid and Softmax:

Softmax is a generalization of the logistic function that "squashes" (maps) a k-dimensional vector z of arbitrary real values T o a k-dimensional vectorσ (z) of real values in the range (0, 1) so add up to 1.

Sigmoid maps a real value to the interval of (0,1), which is used to classify two.

Instead, Softmax maps a k-dimensional real value vector (a1,a2,a3,a4 ...) into one (B1,b2,b3,b4 ...) where bi is a 0~1 constant, the sum of the output neurons is 1.0, so it is equivalent to the probability value, which can then be entered according to the probability size of the BI. Row multi-category tasks.

Two classification problems when sigmoid and Softmax are the same, they ask for cross entropy loss, and Softmax can be used for multi-classification problems

Softmax is an extension of sigmoid because, when the number of classes is k=2, Softmax regression is degraded to logistic regression. Specifically, when k=2, the assumption function of the Softmax regression is:

Using the characteristics of Softmax regression parameter redundancy, we subtract the vector θ1 from the two parameter vectors to get:

Finally, using θ′ to represent the θ2?θ1, the above formula can be expressed as the probability of predicting one of the categories for the Softmax regression.

The probability of another category is

This is consistent with logistic regression.

The distributions used in Softmax modeling are polynomial distributions, while the logistic is based on the Bernoulli distribution
　　
Multiple logistic regression can also achieve multi-classification effect by superposition, but the multi-classification of Softmax regression, class and class is mutually exclusive, that is, one input can only be classified as a class, multiple logistic regression is multi-classified, the output category is not mutually exclusive, namely "Apple" This term belongs to both the "fruit" category and the "3C" class.

5. How to Choose

The choice is based on the advantages and disadvantages of each function, for example:

If you use ReLU, be careful to set learning rate, pay attention to not let the network appear a lot of "dead" neurons, if not good solution, you can try leaky ReLU, Prelu or maxout.

Common activation function comparison

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Common activation function comparison

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Common activation function comparison

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support