Deep Learning Basics Series (vi) | Selection of weight initialization

Source: Internet
Author: User
Tags keras

Deep networks require an excellent weight initialization scheme to reduce the risk of gradient explosions and gradients disappearing. First explain the reason for the gradient explosion and the gradient vanishing, assuming we have the following forward propagation path:

A1 = w1x + B1

Z1 =σ (A1)

A2 = w2z1 + b2

Z2 =σ (A2)

...

an = Wnzn-1 + bn

Zn =σ (AN)

For simplicity, so that all B is 0, you get:

Zn =σ (Wnσ (wn-1σ (... Σ (W1X))),

If further simplification is made, Z =σ (a) = A, then you can get:

Zn = wn * Wn-1 * Wn-1 *...* X

And the weight of the choice of W, assuming that all are 1.5, then can be observed that Zn is the exponential increase in the number of degrees, deep network deeper, meaning that the higher the value of the later, the explosion trend; Conversely, W is assumed to be 0.5, then it can be observed that Zn is rendered exponentially descending, deeper network, meaning the lower the value of the

If z =σ (a) = Sigmoid (a), and A=∑nwixi + B, where n is the number of input parameters, when the input parameters are many, guess |a| a large probability is greater than 1, for sigmoid function, |a|>1, it means that the curve is getting smoother, Z-values tend to be closer to 1 or 0, which can also cause gradients to disappear.

What if we can give a suitable value to W when we initialize the weights in each layer of the network, can we reduce the possibility of this gradient explosion or gradient disappearing? Let's see how to choose.

One, random distribution weights

In Keras, whose function is: k.random_uniform_variable (), let's take a visual look at its data distribution map, first look at the code:

ImportNumPy as NPImportMatplotlib.pyplot as PltImportTensorflow.keras.backend as Kw= K.eval (K.random_uniform_variable (shape= (1, 10000), Low=-1, high=1)) W= W.reshape (-1)Print("W:", W) x= K.eval (K.random_uniform_variable (shape= (1, 10000), Low=-1, high=1)) x= X.reshape (-1)Print("x:", X) a=Np.dot (w, x)Print("A:", a) n, bins, patches= Plt.hist (W, Density=1, facecolor='g', alpha=0.75) Plt.xlabel ('Data Range') Plt.ylabel ('probability') Plt.axis ([-2, 2, 0, 1]) Plt.grid (True) plt.show ()

The image is:

The observation image shows that the random function takes 10,000 points, the value range is constrained between -1~1, and its probability distribution is very uniform.

The output is:

W: [ -0.3033681   0.95340157  0.76744485  ... 0.24013376  0.5394962-0.23630977]x: [ -0.19380212  0.86640644  0.6185038  ... -0.66250014-0.2095201  0.23459053]a:16.111116

From the results, if our input is 10,000 feature points, then A=∑10000wixi + B, from the image to see that the probability of |a|>1 is very large (the result is 16.111116). It is conceivable that no activation function or Relu function, there is the possibility of gradient explosion, if the use of sigmoid activation function, it will cause the gradient to disappear.

Second, the weight of the distribution is too

In Keras, whose functions are: k.random_normal_variable () and K.truncated_normal (), let's take a visual look at its data distribution map, first look at the K.random_normal_variable code:

ImportNumPy as NPImportMatplotlib.pyplot as PltImportTensorflow.keras.backend as Kw= K.eval (K.random_normal_variable (shape= (1, 10000), Mean=0, scale=1)) W= W.reshape (-1)Print("W:", W) x= K.eval (K.random_uniform_variable (shape= (1, 10000), Low=-1, high=1)) x= X.reshape (-1)Print("x:", X) a=Np.dot (w, x)Print("A:", a) n, bins, patches= Plt.hist (W, Density=1, facecolor='g', alpha=0.75) Plt.xlabel ('Data Range') Plt.ylabel ('probability') Plt.axis ([-5, 5, 0, 0.6]) Plt.grid (True) plt.show ()

The image is:

The result is:

W: [ -1.8685548   1.501203    1.1083876  ... -0.93544585  0.08100258  0.4771947]x: [0.40333223  0.7284522  -0.40256715  ... 0.79942155-0.915035  0.50783443]a:-46.02679

And look at the code for K.truncated_normal ():

ImportNumPy as NPImportMatplotlib.pyplot as PltImportTensorflow.keras.backend as Kw= K.eval (K.truncated_normal (shape= (1, 10000), Mean=0, stddev=1)) W= W.reshape (-1)Print("W:", W) x= K.eval (K.random_uniform_variable (shape= (1, 10000), Low=-1, high=1)) x= X.reshape (-1)Print("x:", X) a=Np.dot (w, x)Print("A:", a) n, bins, patches= Plt.hist (W, Density=1, facecolor='g', alpha=0.75) Plt.xlabel ('Data Range') Plt.ylabel ('probability') Plt.axis ([-5, 5, 0, 0.6]) Plt.grid (True) plt.show ()

The image is:

 

The result is:

W: [1.0354282  -0.9385183   0.57337016 ... -0.3302136  -0.10443623  0.9371711]x: [ -0.7896631  - 0.01105547  0.778579 ...  0.7932384  -0.17074609  0.60096693]a:-18.191553

Looking at the two images, we can see that both are positively distributed images, the only difference being that k.truncated_normal () truncates data greater than 2 and less than 2, leaving only part of the data.

From the results, if our input is 10,000 feature points, then A=∑10000wixi + B, from the image observation, although the image has a certain symmetry, the overall average is 0, but there is still a large probability of |a1|>1, there is still the possibility of gradient disappearance and explosion.

Three, is too narrow weight

  Our goal is to make |a1| < 1, so whether the activation function is sigmoid or relu, you can guarantee that the output value of each layer will not grow too large or grow too small. So we can be on the basis of the distribution, let it narrow the sharp, you can let Wi=wi/√n, where n is the number of input parameters of the layer, with 10,000 output feature points for example, Wi=wi/√10000, so A1=∑10000wixi + B1 can be ensured roughly -1~ 1 in the range. Can see the code:

ImportNumPy as NPImportMatplotlib.pyplot as PltImportTensorflow.keras.backend as Kw= K.eval (K.random_normal_variable (shape= (1, 10000), Mean=0, SCALE=1/NP.SQRT (10000)) ) W= W.reshape (-1)Print("W:", W) x= K.eval (K.random_uniform_variable (shape= (1, 10000), Low=-1, high=1)) x= X.reshape (-1)Print("x:", X) a=Np.dot (w, x)Print("A:", a) n, bins, patches= Plt.hist (W, Density=1, facecolor='g', alpha=0.75) Plt.xlabel ('Data Range') Plt.ylabel ('probability') Plt.axis ([-0.1, 0.1, 0, 50]) Plt.grid (True) plt.show ()

The image is:

The result is:

W: [0.00635913-0.01406644-0.00843588 ... -0.00573074  0.00345371-0.01102492]x: [0.3738377  -0.01633143  ] 0.21199775 -0.78332734-0.96384525-0.3478613]a:-0.4904538

The observed image shows that the range of values has been compressed near the -0.025~0.025, the highest probability value is more than 40, becoming narrow and sharp.

From the results we can also know that we successfully compress the |a| within 1 range, the results of the sigmoid function, or Relu function, are relatively friendly, reducing the risk of gradient explosion and gradient disappearance, but also conducive to accelerating the training learning process.

Deep Learning Basics Series (vi) | Selection of weight initialization

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.