"Noisy Activation function" noise activation functions (I.)

Source: Internet
Author: User


This series of articles by the @yhl_leo produced, reproduced please indicate the source.
Article Link: http://blog.csdn.net/yhl_leo/article/details/51736830

Noisy Activation Functions is a new paper on activation function published by ICML in 2016, in which the previous activation function is analyzed in depth, and a new method of adding noise in training process is put forward, which has good effect and feels very meaningful. Visual inspection in the future in the field of deep learning will have a relatively large impact, so the original paper translation, and a few comments (plan two blog to write, this article covers from the summary to the third section of the content), I hope to help you understand, if there is a mistake, please correct me.

Paper url:http://arxiv.org/pdf/1603.00391v3.pdf

Abstract

The common nonlinear activation functions used in neural networks (nonlinear Activation Functions, which can also be referred to as NAF), because of the saturation (saturation) phenomenon of the activation function itself (saturation phenomenon, which is explained in the following article refers to the training convergence close to the target, The derivative tends to be 0, which is unfavorable to convergence, that is, the closer the target is, if the learning rate is fixed, the difference between the result of each iteration update and the previous iteration result will be less, which may lead to training difficulties, which might make some VANILLA-SGD (Use only one step degree) insensitive correlation loss. The paper proposes to inject the appropriate noise to make the gradient more pronounced (as compared to the case where the 0 gradient is possible for an activation function that does not use noise). The introduction of a large amount of noise will dominate the noise-free gradient (meaning the introduction of noise, change the gradient size and direction in the absence of noise), making the random gradient descent (Stochastic Gradient descent, SGD) algorithm more likely to be attempted during the convergence process. We add noise in the indeterminate portion of the demerit function (problematic parts) (as can be seen later in this article, the noise is added to the part where the derivative of the activation function is zero, which may be referred to as problematic parts), trying to get the optimization process to explore degradation/ The boundary between the good part of the saturation and activation function (the good part should be the part that complements the indeterminate part). When the number of noises is reduced by annealing, it makes it easier to optimize the hard objective function, and we establish a simulated annealing relationship. Experiments have shown that using an activation function with a noise variable instead of a traditional saturated activation function can help train in many situations, produce very good results in different datasets and tasks, especially when training seems very difficult, for example, when you need to learn through a course (Curriculum learning , Bengio et al., 2009) When good results are obtained.

1.Introduction

Similar to ReLU and Maxout (Goodfellow et al., 2013) The introduction of this piecewise linear activation function has far-reaching effects on deep learning and has become a major catalyst to make it possible to train deeper neural networks. Thanks to the relu, we were able to realize for the first time that the deep-net supervisory network could be trained (Glorot et al., 2011), whereas the previous tanh nonlinear function could only train shallow networks. It seems reasonable to assume that the recent surge in attention to these piecewise linear activation functions is due to the fact that these activation functions are easier for use with SGD and BP (back-propagation) optimizations (relative to the use of smooth activation functions such as sigmiod and tanh ). Recently, we can see the successful case of piecewise linear function in the field of computer vision, which makes ReLU a standard in the activation layer of convolutional networks.

We propose a new technique for training neural networks that uses a strongly saturated activation function when the input is very large. The practice is to inject noise into the saturated part of the activation function and learn the scale of the noise. Using this approach, we find it feasible to train neural networks using a wider range of activation functions. Before that, it was proposed to add noise to the ReLU unit (Bengio et al., 2013; Nair & Hinton, 2010) in the former Feedback network (Feed-forward networks) and Boltzmann machine (Boltzmann machines), to stimulate the unit for more exploration, simple optimization process.

A recent resurgence of concerns over complex gate structures such as LSTMS (hochreiter& Schmidhuber, 1997) and Grus (Cho et al., 2014), while focusing on neural mechanisms (neural Attention mechanisms, DeSimone et al., 1995) appeared NTM (Graves et al, Networks), Memory Weston (Automatic et al.,) Image captioning (Xu et al.), Video caption Generation (Yao et al., 2015) and a wider range of applications (LeCun et al., 2015). One of the things that runs through these studies is the use of soft-saturated nonlinear functions to simulate hard-decision problems in logic circuits, for example, sigmiod and softmax . Despite some successes, there are two issues that cannot be overlooked:

    1. Because of the saturation characteristic of the nonlinear function, there exists a gradient vanishing problem when crossing the "gate";
    2. Because the nonlinear function is only soft-saturated, it cannot realize hard decision.

Because the gates often operate under soft saturation (Karpathy et al., 2015; Bahdanau et al., 2014; Hermann et al, 2015), which causes the gate to not fully open or shut down. We use a novel approach to solve the above problem. Our approach solves the second problem by using a hard-saturated nonlinear function, which allows the gate to be fully opened or closed when saturated. Because the gate can be completely opened or closed, there will be no information loss caused by the leakage of the soft gate architecture (because the decision or classification of the soft Gate is an approximation).

With the use of a hard-saturated nonlinear function, the gradient flow is exacerbated by the fact that the gradient is precisely set to 0 instead of tending to 0 when saturated. However, by entering the noise in the activation function that changes with the saturation level, the random exploration can be facilitated. (These two words, I think this is understood, simply using the hard-saturated nonlinear function, because the gradient is 0 in saturation, there is no way to continue optimization, so that the hard-saturated nonlinear function with the advantages of hard decision, but it is not conducive to training, so in order to avoid the saturation can not continue to optimize and explore, the second sentence points out , if the appropriate noise is introduced according to the degree/magnitude of saturation, the gradient is not 0,sgd and so on, the method can still be explored. )

In the test, the noise in the activation function can be eliminated or replaced with the expected value, and according to experiments, (our method) in the decision-making network of a variety of tasks more than the method of soft-saturated function, and as long as the simple and direct replacement of the existing training code in the non-linear function of the part, you can get eye-catching

The proposed technique solves the problem of optimization, which can be hard-activated for the gate unit during testing, and a simulated annealing method applied to neural network is proposed.

Hannun et al. (2014); Le et al. (2015) A ReLU activation function was used in a simple rnns. In this paper, we have successfully confirmed that it is possible to use piecewise linear activation functions in a cyclic network with gate architecture (for example, LSTM, GRU ' s).

2. saturating Activation Functions

Define 2.1: activation function (Activation functions). The activation function deserves to be a mapping relationship h:r→r, which is almost within the entire defined domain. (R is a set of real numbers in a number of fields, that is, the image and the original image in the mapping relationship are real numbers)

Define 2.2: saturation (saturation). When the function h ' (x) of the activation function h (x) satisfies the x→∞ (or x→?∞) value of 0, it is said to be right (left) saturated. When the activation function satisfies both left and right saturation, it is called saturated. (expressed in mathematical formula as follows)



Most common activation functions that are used in a looping network (for example sigmoid , and tanh ) are saturated. And they are all soft saturated, that is to say, they can only reach the true saturation in the limit state.

Definition 2.3: hard/soft saturation (hardened and Soft saturation). For any x, if there is a constant C, satisfies x > C has h ' (x) = 0 and X < C has h ' (x) = 0, it is said that this activation function is hard activation, and as previously described in the limit state of the partial derivative equal to 0 of the function, called soft saturation. (For ease of understanding, I have drawn two function curves in MATLAB, see Figure 1.) )



Figure 1 Hard/soft saturation

We can use Taylor expansion to expand near 0, and to approximate the soft-saturation function (only to the first-order part), thus constructing a hard saturation function.

Take the sigmoid function and the Tanh function as an example to expand around 0:



The resulting linear approximate solution:



(also draw hard-sigmoid (x) and Hard-tanh (x), see Figure 2.) )



Fig. 2 Hard-sigmoid and Hard-tanh

The motive of this construction is to make the near 0 o'clock function linear, so that the gradient flow is more pronounced when the unit is not saturated, and the hard decision can be obtained in the saturated part.

The hard-sigmoid and Hard-tanh functions can obtain hard decisions, but the cost is that the gradient of the saturated region is 0. This can lead to training difficulties, and the introduction of small, but not infinitesimal, changes can help solve the problem before pre-activation, while at the same time not affecting the overall gradient.

In the subsequent sections of the document, we use H (x) to represent the general activation function, and U (x) to represent the linear function that was previously described by Taylor at 0 to expand the retention to the first-order component. (See Figure 2) The Hard-sigmoid function is saturated within the interval x is less than or equal to 2 and greater than or equal to 2, while hard-tanh in the interval x is less than or equal to 1 and greater than or equal to 1. We make the XT a function (saturated boundary) threshold. For both functions, the absolute value is xt=2 (hard-sigmoid) and Xt=1 (Hard-tanh) respectively.

Note that we point out that the hard-sigmoid (x) and Hard-tanh (x) functions are both contracted mappings (contractive mapping). This contraction is only true if the absolute value of the input value is greater than the threshold value (as mentioned earlier). One of the most important differences in these activation functions is the fixed point, which is described in this blog post for Point-of-fixed points and suction fixed points. The fixed point of hard-sigmoid (x) is X=2/3, while the fixed point of sigmoid (x) is about 0.69. For any real number between 1 and 1, it is the fixed point of the Hard-tanh (x) function, and the fixed point of the Tanh (x) function has only one x=0. In addition, the Tanh (x) and sigmoid (x) functions are not fixed points of the suction point. The mathematical differences of these saturated activation functions cause them to be larger in rnns and deep networks.

In some applications, the steep and uneven gradient descent trajectories, the obtained parameters may make the activation unit tend to 0 gradient regions, in which case it is very likely to be difficult to get rid of and trapped in this 0 gradient region.

When the (active) unit is full and the gradient disappears, the method of algorithm recovery is often to devote more training data and do a lot of computation to compensate.



Figure 1. Derivative plots of various activation functions

3. annealing with Noisy Activation Functions

Given a noise activation function φ (x,ξ), we inject the noise ξ of the independently distributed (independent identically distributed, IID) to replace the hard-sigmoid and Hard-tanh as mentioned earlier Such a saturated nonlinear function. In the next section, we will describe the proposed noise activation function, but we would like to introduce this kind of noise activation function Family First.

The ξ has a variance of σ^2, and the mean value is 0. We want to describe what happens when we gradually anneal the noise, that is, from a lot of noise gradually to the absence of noise.

Furthermore, we assume that φ satisfies that when the magnitude of the noise is very large, φ has a limit on the derivative of x:



The limit φ (x,0), which corresponds to a noise of 0, is the normal deterministic nonlinear activation function mentioned earlier, in our experiment, this (noise activation function φ) is piecewise linear and can be learned to get the complex function we want. Figure 2 illustrates the concept just mentioned, when the number of noise tends to infinity, the BP algorithm due to the large partial derivative of φ to obtain a large gradient, the noise will overwhelm the signal (as shown in the example of the variance is infinite when the gradient is far greater than the variance of 0 o'clock, when the input is considered a real signal)). Therefore, SGD takes the model to solve the parameter with the noise to explore everywhere, and because the gradient is infinite, also cannot know the so-called trend direction.


Figure 2. A 1-dimensional, non-convex objective function, which uses a simple gradient descent method to be ineffective. When the number of noise is large, the SGD algorithm can be explored to avoid saddle point and local minimum extremum point. When the amount of noise is annealed to near 0 o'clock, the SGD algorithm can eventually converge to a local extremum point x*.

The annealing process is correlated with the SNR (Signal to Noise Ratio, SNR), where the SNR can be defined as the ratio of the signal to the noise variance:



If the SNR tends to be 0, the model is randomly explored (there is no so-called gradient descent trend). As the annealing process, the SNR will gradually increase (noise variance reduction), and when the noise variance converges to 0 o'clock, the only noise source in the training exploration is the random gradient of Monte Carlo estimation.

In previous studies, there were just a few methods, such as simulated annealing (Kirkpatrick et al., 1983) and Continuation methods (Allgower & Georg, 1980), in the above mentioned excellent The goal of non-convex package is very helpful to us. Even with a lot of noise, SGD is free to explore more parameter spaces. As the noise decreases, it tends to stay where the signal strength is enough to be sensed by SGD: Given the limited SGD iteration step, the noise is unevenly distributed and its variance dominates the range. SGD then spends more time in a global, better-than-optimal parameter space. When the noise is close to 0 o'clock, it is equivalent to our solution in fine tuning (fine-tuning) and the noise-free objective function that converges to the least loss.

Later chapters, to be continued ~ ~ ~

"Noisy Activation function" noise activation functions (I.)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.