Deep Learning Basics Series (vi) | Selection of weight initialization

Last Update:2018-10-16 Source: Internet

Author: User

Tags keras

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Deep networks require an excellent weight initialization scheme to reduce the risk of gradient explosions and gradients disappearing. First explain the reason for the gradient explosion and the gradient vanishing, assuming we have the following forward propagation path:

A1 = w1x + B1

Z1 =σ (A1)

A2 = w2z1 + b2

Z2 =σ (A2)

...

an = Wnzn-1 + bn

Zn =σ (AN)

For simplicity, so that all B is 0, you get:

Zn =σ (Wnσ (wn-1σ (... Σ (W1X))),

If further simplification is made, Z =σ (a) = A, then you can get:

Zn = wn * Wn-1 * Wn-1 *...* X

And the weight of the choice of W, assuming that all are 1.5, then can be observed that Zn is the exponential increase in the number of degrees, deep network deeper, meaning that the higher the value of the later, the explosion trend; Conversely, W is assumed to be 0.5, then it can be observed that Zn is rendered exponentially descending, deeper network, meaning the lower the value of the

If z =σ (a) = Sigmoid (a), and A=∑nwixi + B, where n is the number of input parameters, when the input parameters are many, guess |a| a large probability is greater than 1, for sigmoid function, |a|>1, it means that the curve is getting smoother, Z-values tend to be closer to 1 or 0, which can also cause gradients to disappear.

What if we can give a suitable value to W when we initialize the weights in each layer of the network, can we reduce the possibility of this gradient explosion or gradient disappearing? Let's see how to choose.

One, random distribution weights

In Keras, whose function is: k.random_uniform_variable (), let's take a visual look at its data distribution map, first look at the code:

ImportNumPy as NPImportMatplotlib.pyplot as PltImportTensorflow.keras.backend as Kw= K.eval (K.random_uniform_variable (shape= (1, 10000), Low=-1, high=1)) W= W.reshape (-1)Print("W:", W) x= K.eval (K.random_uniform_variable (shape= (1, 10000), Low=-1, high=1)) x= X.reshape (-1)Print("x:", X) a=Np.dot (w, x)Print("A:", a) n, bins, patches= Plt.hist (W, Density=1, facecolor='g', alpha=0.75) Plt.xlabel ('Data Range') Plt.ylabel ('probability') Plt.axis ([-2, 2, 0, 1]) Plt.grid (True) plt.show ()

The image is:

The observation image shows that the random function takes 10,000 points, the value range is constrained between -1~1, and its probability distribution is very uniform.

The output is:

W: [ -0.3033681   0.95340157  0.76744485  ... 0.24013376  0.5394962-0.23630977]x: [ -0.19380212  0.86640644  0.6185038  ... -0.66250014-0.2095201  0.23459053]a:16.111116

From the results, if our input is 10,000 feature points, then A=∑10000wixi + B, from the image to see that the probability of |a|>1 is very large (the result is 16.111116). It is conceivable that no activation function or Relu function, there is the possibility of gradient explosion, if the use of sigmoid activation function, it will cause the gradient to disappear.

Second, the weight of the distribution is too

In Keras, whose functions are: k.random_normal_variable () and K.truncated_normal (), let's take a visual look at its data distribution map, first look at the K.random_normal_variable code:

ImportNumPy as NPImportMatplotlib.pyplot as PltImportTensorflow.keras.backend as Kw= K.eval (K.random_normal_variable (shape= (1, 10000), Mean=0, scale=1)) W= W.reshape (-1)Print("W:", W) x= K.eval (K.random_uniform_variable (shape= (1, 10000), Low=-1, high=1)) x= X.reshape (-1)Print("x:", X) a=Np.dot (w, x)Print("A:", a) n, bins, patches= Plt.hist (W, Density=1, facecolor='g', alpha=0.75) Plt.xlabel ('Data Range') Plt.ylabel ('probability') Plt.axis ([-5, 5, 0, 0.6]) Plt.grid (True) plt.show ()

The image is:

The result is:

W: [ -1.8685548   1.501203    1.1083876  ... -0.93544585  0.08100258  0.4771947]x: [0.40333223  0.7284522  -0.40256715  ... 0.79942155-0.915035  0.50783443]a:-46.02679

And look at the code for K.truncated_normal ():

ImportNumPy as NPImportMatplotlib.pyplot as PltImportTensorflow.keras.backend as Kw= K.eval (K.truncated_normal (shape= (1, 10000), Mean=0, stddev=1)) W= W.reshape (-1)Print("W:", W) x= K.eval (K.random_uniform_variable (shape= (1, 10000), Low=-1, high=1)) x= X.reshape (-1)Print("x:", X) a=Np.dot (w, x)Print("A:", a) n, bins, patches= Plt.hist (W, Density=1, facecolor='g', alpha=0.75) Plt.xlabel ('Data Range') Plt.ylabel ('probability') Plt.axis ([-5, 5, 0, 0.6]) Plt.grid (True) plt.show ()

The image is:

The result is:

W: [1.0354282  -0.9385183   0.57337016 ... -0.3302136  -0.10443623  0.9371711]x: [ -0.7896631  - 0.01105547  0.778579 ...  0.7932384  -0.17074609  0.60096693]a:-18.191553

Looking at the two images, we can see that both are positively distributed images, the only difference being that k.truncated_normal () truncates data greater than 2 and less than 2, leaving only part of the data.

From the results, if our input is 10,000 feature points, then A=∑10000wixi + B, from the image observation, although the image has a certain symmetry, the overall average is 0, but there is still a large probability of |a1|>1, there is still the possibility of gradient disappearance and explosion.

Three, is too narrow weight

　　Our goal is to make |a1| < 1, so whether the activation function is sigmoid or relu, you can guarantee that the output value of each layer will not grow too large or grow too small. So we can be on the basis of the distribution, let it narrow the sharp, you can let Wi=wi/√n, where n is the number of input parameters of the layer, with 10,000 output feature points for example, Wi=wi/√10000, so A1=∑10000wixi + B1 can be ensured roughly -1~ 1 in the range. Can see the code:

ImportNumPy as NPImportMatplotlib.pyplot as PltImportTensorflow.keras.backend as Kw= K.eval (K.random_normal_variable (shape= (1, 10000), Mean=0, SCALE=1/NP.SQRT (10000)) ) W= W.reshape (-1)Print("W:", W) x= K.eval (K.random_uniform_variable (shape= (1, 10000), Low=-1, high=1)) x= X.reshape (-1)Print("x:", X) a=Np.dot (w, x)Print("A:", a) n, bins, patches= Plt.hist (W, Density=1, facecolor='g', alpha=0.75) Plt.xlabel ('Data Range') Plt.ylabel ('probability') Plt.axis ([-0.1, 0.1, 0, 50]) Plt.grid (True) plt.show ()

The image is:

The result is:

W: [0.00635913-0.01406644-0.00843588 ... -0.00573074  0.00345371-0.01102492]x: [0.3738377  -0.01633143  ] 0.21199775 -0.78332734-0.96384525-0.3478613]a:-0.4904538

The observed image shows that the range of values has been compressed near the -0.025~0.025, the highest probability value is more than 40, becoming narrow and sharp.

From the results we can also know that we successfully compress the |a| within 1 range, the results of the sigmoid function, or Relu function, are relatively friendly, reducing the risk of gradient explosion and gradient disappearance, but also conducive to accelerating the training learning process.

Deep Learning Basics Series (vi) | Selection of weight initialization

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More