Model Training Tips

Last Update:2018-08-19 Source: Internet

Author: User

Tags square root

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Model Training Tips Neural network model design training process

Figure 1-1 Neural Model design process

After we have designed and trained the good one neural network, we need to verify that the model works well on the training set. The purpose of this step is to determine if the model is under-fitted, to determine that it is well-fitted on the training set, to validate it on the test set, to redesign the model if the results are poor, and, if so, to increase regularization or increase training data;

Under-fitting processing strategy

When the model does not perform well on the training set, you can use the following methods to deal with problems in excluding data sets and training processes.

Replace activation function sigmoid activation function

The form of the sigmoid function, as shown in (1), is shown in Figure 1-2

Figure 1-2 sigmoid function

ReLu (rectified Linear Unit) activation function

The form of the Relu function, as shown in (2), is shown in graphic structure 1-3

Figure 1-3 Relu function

The reason for using relu as an activation function is that: 1) The calculation is simpler, compared with the sigmoid function, the Relu calculation is simpler 2) Relu equivalent to an infinite number of different offset sigmoid functions superimposed effect 3) Relu can solve the problem of gradient disappearance. Due to the RELU function structure, when a neuron's output is 0 o'clock (1-4), it is equivalent to the neuron in the neural network does not play any role, you can put these neurons from the neural network (1-5).

Figure 1-4 Neurons in the neural network with output of 0

Figure 1-5 "Elongated linear" neural network

leaky relu activation function

As the input of the Relu, the corresponding neurons do not play any role. Therefore, the improvement point of the leaky relu is that when input, the output is no longer 0, but a smaller value. The leaky Relu function structure, as shown in (3), usually requires manual assignment, as in the case of the function structure as shown in 1-6

Figure 1-6 Leaky Relu activation function

parametric relu activation function

Due to the need of artificial assignment in leaky Relu, it is necessary to have some prior knowledge about the value of assignment. Therefore, parametric Relu is a parameter that can be trained, and even every neuron can have a different one.

Figure 1-7 Parametric Relu activation function

The training situation is the same as the general parameters, but with the general parameter update is the use of the Update method with momentum

Which is the momentum, for the learning rate;

Maxout can learn the activation function (learnable Activation function)

Maxout is a learning activation function that can learn the form of Relu functions. Therefore, Relu is a special case of maxout. As shown in maxout structure 1-8, when the output value is multiplied by the weight, it is not sent into the activation function for conversion, but rather as a set of elements (the number of elements to be pre-set), select the maximum value as the output.

Figure 1-8 Maxout Structure

In Figure 1-9, for example, when one of the inputs is 1, the resulting activation function can be implemented as shown in Figure 1-10. Depending on how many elements you select as a group, you can train any segment function.

Figure 1-9 Maxout Example

Figure 1-10 Maxout Training activation function

self-adapting learning rate Adagrad

Adagrad is a relationship between the value of the learning rate and the square root of all previous partial differential values. As an example of a parameter, the specific calculation is as follows

wherein, the partial derivative of the pair is expressed, and the sum of the squares of the partial derivative of the previous parameters is accumulated, and the mean value is obtained, and then the square root is taken.

Rmsprop

Rmsprop's calculation formula is shown in (7), as can be seen from the formula, when the parameter is updated, not only the current gradient is considered, but also the previous history of the gradient. It is a constant that can be set on its own, and when the value is small, it represents a comparison of the current gradient.

Momentum

Momentum's ideas come from real-life scenes, and when we throw a ball into a rough, because of the gravitational potential energy, the ball does not necessarily stay at the first recess, it may turn over the first bump and reach the global lowest point.

Figure 1-11 Momentum Reality scene

Therefore, different from the previous direction of movement only consider the gradient direction, but also take into account the previous direction of movement. The specific formula for the calculation is (8)

over-fitting processing strategy early Stop (Early stopping)

The idea of early stopping is that when the training error of the model on the training set decreases, the test error in the test set may increase, as shown in 1-12. Therefore, a trade-off between training errors and test errors is required.

Figure 1-12 Training error and test error

regularization (regularization)

The purpose of adding regularization is to increase the smoothness of the model, and usually add some parameters related to the existing loss function.

regularization of the L2

Assuming that the loss function is now defined as L (θ), and L2 regularization adds an item, l (θ) Form (9)

After adding the L2, the new form of the parameter becomes (10)

For (10) is the same as not adding L2 regular items, and after adding L2, it is equivalent to the parameter w before the update will always multiply a number less than 1, and therefore always reduce the value of W, this calculation process is called weight Decay. The effect of the L2 is that the parameters are getting closer to 0, and we usually initialize the values close to 0 when we initialize the parameters, and we update the parameters to keep the parameters away from 0, so the effect of L2 is somewhat similar to the effect of an early stop.

regularization of the L1

L1 regularization is very similar to L2 regularization, except that L2 is the sum of squares, and L1 is the absolute value, as in the form of (11)

When the L1 item is added, the parameter is updated in the form

Therefore, when the w>0 is positive, so that the value of w decreases; conversely, when w<0, a positive number is added, which increases the value of W. Since L2 is multiplied by a factor of less than 0 each time, the W decreases significantly, while L1 subtracts a fixed value each time, so the descent is slower; So, in the final trained W, the parameters for adding L2 are generally small, and the parameters for adding L1 can be very small.

Dropout

Dropout's approach is that for a well-established neural network model, each neuron in the original model is sampled every time the parameter is updated, deciding whether to discard the neurons, and each neuron has a p% chance of being thrown away.

Figure 1-13 Dropout sampling process

Figure 1-14 The NN structure after dropout sampling

During training, the model needs to be dropout sampled, but not sampled when testing, and each parameter is multiplied by (1-p)%. As shown in 1-15, assuming a dropout chance of 50%, half of the neurons in training are discarded. In the test, in order to make the test and training output the same as possible, you need to each weight is multiplied by (1-p)%, to maintain the balance of the output value (1-15 right image).

Figure 1-15 Dropout Test weight handling

Dropout principle Explanation

Dropout can be seen as an integrated learning. The practice of integrated learning is roughly to sample multiple data from a training set, and to train different models separately (the structure of the model can be different). The test set is predicted with multiple trained models, and the final result is averaged (1-16).

Figure 1-16 How to deal with integrated learning

It is assumed that the number of neurons in the designed neural network is m, and that each neuron may be dropout or not dropout. Therefore, each neuron has 2 choices, and M neurons have a 2M choice, corresponding to produce 2M model structure. Therefore, when training a model, it is equivalent to training multiple models. For one of the weights in the model, it is shared among the different dropout neural networks.

Figure 1-17 Dropout Training process

However, after training, you need to make predictions. However, it is not possible to store so many models separately and predict them separately. So, in order to solve this problem, in all the dropout model weights are multiplied (1-p)%.

Figure 1-18 Dropout weight processing

The dropout behaves better on a linear activation function. The reason is that when the activation function is linear, the output of the model after all weights are multiplied (1-p)%,dropout is closer to the result of the integrated output.

sigmoid gradient Vanishing Analysis

However, the use of the sigmoid function as an activation function has a phenomenon of gradient disappearance. Is that when the number of hidden layers of the neural network exceeds 3 layers, the underlying parameter update is almost 0, because the sigmoid derivation formula is S (x) ' =s (x) (1-s (x)), when X=0,s (x) =0.5, MAXS (x) ' = 0.25 and when we want to solve the underlying parameters, we need to multiply the slope of the upper parameters, that is, to multiply more than the number of less than 0.25, when the number of multiply more, the value will become very small, resulting in gradient extinction phenomenon. And since the slope of the Relu function is 1, the derivation in the multiplicative, will not produce the above situation.

References

[1] Machine learning-Li Hongyi

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More