Common activation function comparison

Source: Internet
Author: User

The structure of this article:
    1. What is an activation function
    2. Why do you use
    3. What do we have?
    4. Comparison of Sigmoid, ReLU and Softmax
    5. How to choose
1. What is an activation function

For example, in a neuron, the input inputs is weighted, summed, and is also acting as a function, which is the function that activates the function Activation.

2. Why use

Without the excitation function, each layer of output is a linear function of the upper input, no matter how many layers the neural network has, the output is a linear combination of inputs.

If used, the activation function introduces nonlinear factors into the neuron so that the neural network can arbitrarily approximate any nonlinear function so that the neural network can be applied to many non-linear models.

3. What do we have?

(1) sigmoid function

Formula:

Curve:

Also called a logistic function for the output of a hidden layer neuron

Range of values (0, 1)

It can map a real number to an interval of (0, 1) that can be used as a two classification.

The effect is better when the characteristic difference is more complex or the difference is not particularly large.

The following explains why gradients disappear:

In the inverse propagation algorithm, the derivative expression of the sigmoid is the derivation of the activation function:

sigmoid original function and derivative graph as follows:

The figure shows that the derivative starts from 0 and soon approaches 0, causing a "gradient vanishing" phenomenon.

(2) Tanh function

Formula

Curve

Also known as the double tangent tangent function

Value range is [-1, 1]

The effect of Tanh is very good when the characteristic difference is obvious, it will enlarge the characteristic effect in the cycle.

The difference with sigmoid is that the Tanh is 0 mean, so tanh is better than sigmoid in practical applications

(3) ReLU

Rectified Linear Unit (ReLU)-for hidden-layer neuron output

Formula

Curve

When the input signal <0, the output is 0,>0, the output equals the input

Krizhevsky et al. It is found that the rate of convergence of SGD obtained using ReLU is much faster than Sigmoid/tanh.

Disadvantages of Relu:

Training is very "fragile", it is easy to "die"

For example, a very large gradient flows through a relu neuron, and after updating the parameters, the neurons will no longer be active on any data, so the neuron's gradient will always be 0.

If learning rate is large, then it is likely that 40% of the neurons in the network are "dead".

(4) Softmax function

Softmax-for multi-classification neural network output

Formula

Take an example to see what the formula means:

If one ZJ is larger than the other z, then the component of this mapping is approximated to 1, the other is approaching 0, and the main application is multi-classification.

The first reason to take an exponent is to emulate Max's behavior, so make it bigger.

The second reason is that you need a function that can be directed.

4. Comparison of Sigmoid,relu,softmax

Comparison of Sigmoid and Relu:

Sigmoid gradient vanishing problem, the derivative of relu does not exist such a problem, its function expression is as follows:

Curve

The main changes in the comparison of sigmoid class functions are:

    1. Single-sided suppression
    2. The relatively broad excitement borders
    3. Sparse activation

The difference between sigmoid and Softmax:

Softmax is a generalization of the logistic function that "squashes" (maps) a k-dimensional vector z of arbitrary real values T o a k-dimensional vectorσ (z) of real values in the range (0, 1) so add up to 1.

Sigmoid maps a real value to the interval of (0,1), which is used to classify two.

Instead, Softmax maps a k-dimensional real value vector (a1,a2,a3,a4 ...) into one (B1,b2,b3,b4 ...) where bi is a 0~1 constant, the sum of the output neurons is 1.0, so it is equivalent to the probability value, which can then be entered according to the probability size of the BI. Row multi-category tasks.

Two classification problems when sigmoid and Softmax are the same, they ask for cross entropy loss, and Softmax can be used for multi-classification problems

Softmax is an extension of sigmoid because, when the number of classes is k=2, Softmax regression is degraded to logistic regression. Specifically, when k=2, the assumption function of the Softmax regression is:

Using the characteristics of Softmax regression parameter redundancy, we subtract the vector θ1 from the two parameter vectors to get:

Finally, using θ′ to represent the θ2?θ1, the above formula can be expressed as the probability of predicting one of the categories for the Softmax regression.

The probability of another category is

This is consistent with logistic regression.

The distributions used in Softmax modeling are polynomial distributions, while the logistic is based on the Bernoulli distribution
  
Multiple logistic regression can also achieve multi-classification effect by superposition, but the multi-classification of Softmax regression, class and class is mutually exclusive, that is, one input can only be classified as a class, multiple logistic regression is multi-classified, the output category is not mutually exclusive, namely "Apple" This term belongs to both the "fruit" category and the "3C" class.

5. How to Choose

The choice is based on the advantages and disadvantages of each function, for example:

If you use ReLU, be careful to set learning rate, pay attention to not let the network appear a lot of "dead" neurons, if not good solution, you can try leaky ReLU, Prelu or maxout.

Common activation function comparison

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.