ReLu (rectified Linear Units) activation function paper Reference: Deep Sparse rectifier Neural Networks (interesting one paper) Origin: Traditional activation function, neuron activation frequency study, Sparse activation Traditional sigmoid system activation function
The most commonly used two activation functions in traditional neural networks, the Sigmoid system (logistic-sigmoid, tanh-sigmoid) are regarded as the core of neural networks.
Mathematically, the nonlinear sigmoid function has a great effect on the signal gain of the Central, small signal gain on the two sides, and the characteristic space mapping of the signal.
From the point of view of neuroscience, the central region resembles the excited state of neurons, and the two sides are similar to the inhibitory states of neurons, so in the study of neural networks, the key features can be pushed to the central area and the non-key features pushed to the two sides.
Either way, it seems to be a lot smarter than the early linear activation function (Y=X) and the Step activation function ( -1/1,0/1).
Approximate biological nerve activation function: Softplus&relu
In the 2001, neuroscientists Dayan and Abott from a biological perspective, simulating a more accurate activation model for neurons receiving signals, as shown in the figure on the left:
This model compares the main changes of the sigmoid system with three points:① Unilateral inhibition ② relatively wide excitation boundary ③ sparse activation (focus, you can see the red box front-end state is completely inactive)
In the same year, Charles Dugas and others used the Softplus function by accident in making positive regression predictions, and the Softplus function was the original function of logistic-sigmoid function.
SOFtp lu s ( X) = Log (1 +ex)
According to the paper, initially wanted to use an exponential function (natural positive) as an activation function to return, but to the late gradient is too large, difficult to train, so added a log to reduce the upward trend.
The addition of 1 is to ensure non-negative. In the same year, Charles Dugas and others in the Nips conference paper also joked, Softplus can be regarded as mandatory nonnegative correction function Max(0,x) Smooth version.
Incidentally, in the same 2001, the Softplus/rectifier activation function in the ML field has a similar resemblance to the neural science in the field of neuron activation frequency function, which facilitates the study of the new activation function.
The sparse activation of biological nerves
In neuroscience, in addition to the new activation frequency function, neuroscientists have also discovered the sparse activation of neurons.
Or 2001, Attwell and others based on the observation of brain energy consumption, it is speculated that the working mode of neuron coding is sparse and distributed.
In 2003, Lennie and others estimated that the brain was activated by only 1~4% neurons, further indicating the sparsity of neuronal work.
From the signal point of view, that is, the neuron at the same time only a small part of the input signal selective response, a large number of signals are deliberately shielded, which can improve the accuracy of learning, better and faster to extract sparse features.
From this point of view, after the initialization of the empirical rule W, the traditional sigmoid system function is almost half of the neurons are activated, which is not in line with the research of neuroscience, and will bring great problems to the deep network training.
Softplus took care of the first two points of the new model, but there was no sparse activation. Thus, the correction function Max(0,x) is the largest winner that approximates the model.
Part I: A view of sparsity
The disruptive research in machine learning is a sparse feature, and the branch of deep learning is derived from the research of sparse features based on data.
The sparsity concept was first introduced by Olshausen and field in the 1997 to the research of sparse coding of signal data, and was first applied in convolutional neural networks.
In recent years, the research of sparsity has not only been active in computational neuroscience, machine learning, but also in the field of signal processing and statistics.
In summary, the sparsity has the following three contributions:
1.1 Information Dissociation
Currently, a clear goal of deep learning is to extract key factors from data variables. The original data, which is dominated by natural data, is usually intertwined with highly dense features. Reason
Is that these eigenvectors are interrelated, a small key factor may lead to a bunch of features, a bit like the butterfly effect, reaching.
Traditional machine learning methods based on mathematical principles have a fatal weakness in the dissociation of these associated features.
However, if the complex relationship between features can be solved and converted to sparse features, the feature is robust (eliminating extraneous noise).
1.2 linearity of the scalability
Sparse features are more likely to be linearly divided, or have smaller dependencies on nonlinear mapping mechanisms. Because the sparse feature is on a high-dimensional feature space (it is automatically mapped)
From the manifold learning point of view (see Noise Reduction Auto Encoder), sparse features are moved to a purer low-dimensional manifold surface.
Linear separable can also refer to natural sparse text-based data, even if there is no hidden layer structure, it could still be separated very well.
1.3 Dense distribution but sparse
The characteristic of dense winding distribution is that the information is the most rich, and from the latent point of view, it is often more effective than the characteristic that the local few points carry.
The sparse features, which are extracted from the dense winding zone, are of great potential value.
1.4 Effects of the contribution of the sparse activation function:
Different inputs may contain key features of different sizes, and the use of variable data structures to make containers is more flexible.
If neuron activation is sparse, then on different activation paths: different numbers (selective inactive), different functions (distributed activation),
Two kinds of optimized structure generated activation path, can better from the effective data of the dimension, learning to the relatively sparse features, play an automatic dissolve away from the effect.
Part II: sparsity-based correction activation function 2.1 unsaturated linear end
Aside from sparse activation, the correct activation function, Max(0,x), differs significantly from the Softplus function at the excited end ( Linear and non-linear).
Over the decades of machine learning, we have developed the concept that nonlinear activation functions are more advanced than linear activation functions.
Especially in the BP neural Network, which is full of sigmoid function, the SVM neural network, which is full of radial basis function, often has the illusion that the nonlinear function has great contribution to the nonlinear network.
This illusion is more severe in SVM. The form of kernel function is not exactly the main hero of SVM to deal with nonlinear data (the support vector acts as the hidden layer role).
So in the deep network, the dependence on the nonlinearity can be shrunk. In addition, it is mentioned in the previous section that sparse features do not require a strong network to deal with linear non-fractal mechanisms.
Combined with the above two points, it may be more appropriate to use a simple, fast linear activation function in a deep learning model.
, once the neuron changes linearly with neurons, the nonlinear part of the network only comes from the neuron partial selective activation.
2.2 Vanishing Gradient problem
Another reason for the more inclined use of the linear neural activation function is the vanishing Gradient problem when the gradient method is used to train the depth network.
People who have seen the BP derivation know that when the error is reversed from the output layer to calculate the gradient, the input neuron value of the current layer is multiplied at each layer to activate the first derivative of the function.
ThatGRAD=e Rro r⋅ si gm oi d ′ (x ) ⋅x 。 Using double-ended saturation (that is, the range is limited) there are two problems with the sigmoid system function:
①sigmoid ' (x) ∈ (0,1) derivative scaling
②x∈ (0,1) or x∈ ( -1,1) Saturation value scaling
In this way, after each layer, the error is multiplied by the attenuation, once the recursive multi-layered reverse propagation, the gradient will continue to decay, disappear, so that network learning slows down.
The gradient of the correction activation function is 1, and only one end is saturated, the gradient is very good in the reverse propagation, the training speed has been greatly improved.
The Softplus function is slightly slower, Softplus ' (x) =sigmoid (x) ∈ (0,1), but also single-ended saturated, so the speed is still faster than the Sigmoid system function.
Part III potential problems force the introduction of sparse zero rationality?
Admittedly, Sparsity has a number of advantages. However, excessive forced sparse processing can reduce the effective capacity of the model. That is, there are too many feature masks, which makes the model unable to learn effective features.
In this paper, the introduction degree of sparsity is experimented, and the ideal sparsity (force 0) ratio is 70%~85%. Over 85%, network capacity becomes a problem, resulting in an extremely high error rate.
Compared with the 95% sparsity of brain work, there is still a big gap between the existing computational neural network and the biological neural network.
Fortunately, Relu only a negative value will be sparse, that is, the introduction of sparsity can be trained to regulate, is dynamic change.
As long as the gradient training, the network can be reduced to the direction of error, automatic control of the sparse ratio, to ensure that the activation chain has a reasonable number of non-0 values.
Part IV Relu's contribution
4.1 narrowing the gap between doing and not unsupervised pre-training
The use of Relu allows the network to introduce sparsity on its own. This approach is equivalent to the pre-training of unsupervised learning.
Of course, the effect is definitely not pre-trained. The data given in the paper show that Relu activates the network ahead of other activation functions without pre-training.
Even better than the normal activation function pre-training after the wonderful situation. Of course, Relu still has room for improvement after pre-training.
From this aspect, Relu reduces the generation gap between unsupervised learning and supervised learning. Of course, there are faster training speeds.
4.2 Faster Feature learning
In Mnist+lenet4, the Relu+tanh combination can reduce the validation set error rate to 1.05% at the Epoch 50
However, the full tanh at Epoch 150 or 1.37%, the result Relu+tanh at the Epoch 17 o'clock.
This figure from Alexnet's paper on the Relu and common sigmoid function of the comparison test, you can see that the use of relu, so that the learning cycle
Greatly shortened. Combined rate and efficiency, most of the activation functions in the DL should choose Relu.
Implementation of Relu in part V Theano
The Relu can be implemented directly with T.maximum (0,X) and cannot be derivative by T.max (0,x).
Part VI Relu Training Tips
See CIFAR-10 training skills
ReLu (rectified Linear Units) activation function