Initialization of deep networks

Last Update:2015-05-02 Source: Internet

Author: User

Tags scale image

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Initialization of deep networksGustav Larsson

As we all know, the solution to a Non-convex optimization algorithm (like stochastic gradient descent) depends on the Init ial values of the parameters. This post was about choosing initialization parameters for deep networks and how it affects the convergence. We'll also discuss the related topic of vanishing gradients.

First, let's go back to the time of sigmoidal activation functions and initialization of parameters using IID Gaussian or Uniform distributions with fairly arbitrarily set variances. Building deep networks was difficult because of exploding or vanishing activations and gradients. Let's take activations first:if all your parameters is too small, the variance of your activations would drop in each lay Er. This was a problem if your activation function is sigmoidal, since it was approximately linear close to 0. That's, you gradually lose your non-linearity, which means there are no benefit to have multiple layers. If, on the other hand, your activations become larger and larger, then your activations would saturate and become meaningle SS, with gradients approaching 0.

Let us consider one layer and forget about the bias. Note the following analysis and Conclussion are taken from Glorot and bengio[1]. Consider a weight matrixW∈Rmxn , where each element is drawn from the IID Guassian with varianceVaR(W) . Note that we is a bit abusive with notation lettingWDenote both a matrix and a univariate random variable. We also assume there is no correlation between our input and our weights and both are zero-mean. If We consider one filter (row) in W, say w (a random vector), then the variance of the output signal over the input signal is:

VAR(WTX)VAR(X)=∑NNVAR(WNXN)VAR(X)=NVAR(W)VAR(x) var (x=nvar ( Span id= "mathjax-span-129" class= "Mi" >w)

As we build a deep network, we want the variance of the signal going forward on the network to remain the same, thus it wo Uld be advantageous ifnVaR(W)=1. The same argument can be made for the gradients, the signal going backward in the network, and the conclusion are that we w Ould also likemVaR(W)=1. Unlessn=m, It is impossible to sastify both of these conditions. In practice, it works well if both is approximately satisfied. One thing that have never been clear to me are why it's only necessary to satisfy these conditions when picking the initial ization values of W. It would seem that we had no guarantee that the conditions would remain true as the network is trained.

Nevertheless, this Xavier initialization (after Glorot's first name) is a neat trick the works well in practice. However, along came rectified linear units (ReLU), a non-linearity that's scale-invariant around 0 and does not Saturate at large input values. This seemingly solved both of the problems the sigmoid function had; Or were they just alleviated? I am unsure of how widely used Xavier initialization are, but if it isn't, perhaps it is because ReLU seemingly eliminated This problem.

However, take the most competative network as of recently, Vgg[2]. They does not use the this kind of initialization, although they the report of it is tricky to get their networks to converge. They say that they first trained their most shallow architecture and then used so to help initialize the second one, and So forth. They presented 6 networks, so it seems like a awfully complicated training process to get to the deepest one.

A recent paper by He et al.[3] presents A pretty straightforward generalization of ReLU and leaky ReLU. What are more interesting are their emphasis on the benefits of Xavier initialization even for ReLU. They re-did the derivations for Relus and discovered, the conditions were the same up to a factor 2. The difficulty Simonyan and Zisserman had training Vgg are apparently avoidable, simply by using Xavier intialization (or b Etter yet the ReLU adjusted version). Using this technique, He et al. reportedly trained a whopping 30-layer deep network to convergence in one go.

Another recent paper tackling the signal scaling problem are by Ioffe and Szegedy[4]. They call the scale , internal covariate shift and claim this forces learning rates to be Unnec Essarily Small. They suggest that if all layers has the same scale and remain so throughout training, a much higher learning rate becomes Practically viable. You cannot just standardize the signals, since your would lose expressive power (the bias disappears and in the case of the SIG Moids we would is constrained to the linear regime). They solve this by re-introducing the parameters per layer, scaling and bias, added again after standardization. The training reportedly becomes about 6 times faster and they present state-of-the-art results on ImageNet. However, I ' m not certain this is the solution that would stick.

I reckon we'll see a lot more work on this frontier in the next few years. Especially since it also relates to the-right now wildly popular-recurrent neural Network (RNN), which connects OUTP UT signals back as inputs. The train such network is so you unroll the time axis, treating the result as a extremely deep feedforward netw Ork. This greatly exacerbates the vanishing gradient problem. A popular solution, called Long Short-term Memory (LSTM), is to introduce memory cells, which was a type of teleport that Allows a signal to ahead many time steps. This means, the gradient is retained for all those time steps and can being propagated back to a much earlier time Withou T vanishing.

This is the far from solved, and until then I think I'll be sticking to Xavier initialization. If you were using Caffe, the one take-away of this post was to use the following on all your layers:

weight_filler {     type: "xavier" }

References

X. Glorot and Y. Bengio, "Understanding the difficulty of training deep feedforward neural Networks," international con Ference on artificial Intelligence and statistics, Mon, pp. 249–256.
K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," ArXiv preprint arxiv:1 409.1556, 2014. [PDF]
K. He, X. Zhang, S. Ren, and J. Sun, "Delving deep into rectifiers:surpassing human-level performance on ImageNet Classif Ication, "arxiv:1502.01852 [CS], Feb. 2015. [PDF]
S. Ioffe and C. Szegedy, "Batch normalization:accelerating deep Network Training by reducing Internal covariate Shift," a rxiv:1502.03167 [CS], Feb. 2015. [PDF]

Creating an LMDB database in Python
Local Torch Installation
Python dictionary to HDF5

Initialization of deep networks

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More