Model Training Tips
Neural network model design training process
Figure 1-1 Neural Model design process
After we have designed and trained the good one neural network, we need to verify that the model works well on the training set. The purpose of this step is to determine if the model is under-fitted, to determine that it is well-fitted on the training set, to validate it on the test set, to redesign the model if the results are poor, and, if so, to increase regularization or increase training data;
Under-fitting processing strategy
When the model does not perform well on the training set, you can use the following methods to deal with problems in excluding data sets and training processes.
Replace activation function
sigmoid activation function
The form of the sigmoid function, as shown in (1), is shown in Figure 1-2
Figure 1-2 sigmoid function
However, the use of the sigmoid function as an activation function has a phenomenon of gradient disappearance. Is that when the number of hidden layers of the neural network exceeds 3 layers, the underlying parameters are updated almost 0;
ReLu (rectified Linear Unit) activation function
The form of the Relu function, as shown in (2), is shown in graphic structure 1-3
Figure 1-3 Relu function
The reason for using relu as an activation function is that: 1) The calculation is simpler, compared with the sigmoid function, the Relu calculation is simpler 2) Relu equivalent to an infinite number of different offset sigmoid functions superimposed effect 3) Relu can solve the problem of gradient disappearance. Due to the RELU function structure, when a neuron's output is 0 o'clock (1-4), it is equivalent to the neuron in the neural network does not play any role, you can put these neurons from the neural network (1-5).
Figure 1-4 Neurons in the neural network with output of 0
Figure 1-5 "Elongated linear" neural network
leaky relu activation function
As the input of the Relu, the corresponding neurons do not play any role. Therefore, the improvement point of the leaky relu is that when input, the output is no longer 0, but a smaller value. The leaky Relu function structure, as shown in (3), usually requires manual assignment, as in the case of the function structure as shown in 1-6
Figure 1-6 Leaky Relu activation function
parametric relu activation function
Due to the need of artificial assignment in leaky Relu, it is necessary to have some prior knowledge about the value of assignment. Therefore, parametric Relu is a parameter that can be trained, and even every neuron can have a different one.
Figure 1-7 Parametric Relu activation function
The training situation is the same as the general parameters, but with the general parameter update is the use of the Update method with momentum
Which is the momentum, for the learning rate;
Maxout can learn the activation function (learnable Activation function)
Maxout is a learning activation function that can learn the form of Relu functions. Therefore, Relu is a special case of maxout. As shown in maxout structure 1-8, when the output value is multiplied by the weight, it is not sent into the activation function for conversion, but rather as a set of elements (the number of elements to be pre-set), select the maximum value as the output.
Figure 1-8 Maxout Structure
In Figure 1-9, for example, when one of the inputs is 1, the resulting activation function can be implemented as shown in Figure 1-10. Depending on how many elements you select as a group, you can train any segment function.
Figure 1-9 Maxout Example
Figure 1-10 Maxout Training activation function
self-adapting learning rate
Adagrad
Adagrad is a relationship between the value of the learning rate and the square root of all previous partial differential values. As an example of a parameter, the specific calculation is as follows
wherein, the partial derivative of the pair is expressed, and the sum of the squares of the partial derivative of the previous parameters is accumulated, and the mean value is obtained, and then the square root is taken.
Rmsprop
Rmsprop's calculation formula is shown in (7), as can be seen from the formula, when the parameter is updated, not only the current gradient is considered, but also the previous history of the gradient. It is a constant that can be set on its own, and when the value is small, it represents a comparison of the current gradient.
Momentum
Momentum's ideas come from real-life scenes, and when we throw a ball into a rough, because of the gravitational potential energy, the ball does not necessarily stay at the first recess, it may turn over the first bump and reach the global lowest point.
Figure 1-11 Momentum Reality scene
Therefore, different from the previous direction of movement only consider the gradient direction, but also take into account the previous direction of movement. The specific formula for the calculation is (8)
over-fitting processing strategy
early Stop (Early stopping)
The idea of early stopping is that when the training error of the model on the training set decreases, the test error in the test set may increase, as shown in 1-12. Therefore, a trade-off between training errors and test errors is required.
Figure 1-12 Training error and test error
regularization (regularization)
The purpose of adding regularization is to increase the smoothness of the model, and usually add some parameters related to the existing loss function.
regularization of the L2
Assuming that the loss function is now defined as L (θ), and L2 regularization adds an item, l (θ) Form (9)
After adding the L2, the new form of the parameter becomes (10)
For (10) is the same as not adding L2 regular items, and after adding L2, it is equivalent to the parameter w before the update will always multiply a number less than 1, and therefore always reduce the value of W, this calculation process is called weight Decay. The effect of the L2 is that the parameters are getting closer to 0, and we usually initialize the values close to 0 when we initialize the parameters, and we update the parameters to keep the parameters away from 0, so the effect of L2 is somewhat similar to the effect of an early stop.
regularization of the L1
L1 regularization is very similar to L2 regularization, except that L2 is the sum of squares, and L1 is the absolute value, as in the form of (11)
When the L1 item is added, the parameter is updated in the form
Therefore, when the w>0 is positive, so that the value of w decreases; conversely, when w<0, a positive number is added, which increases the value of W. Since L2 is multiplied by a factor of less than 0 each time, the W decreases significantly, while L1 subtracts a fixed value each time, so the descent is slower; So, in the final trained W, the parameters for adding L2 are generally small, and the parameters for adding L1 can be very small.
Dropout
Dropout's approach is that for a well-established neural network model, each neuron in the original model is sampled every time the parameter is updated, deciding whether to discard the neurons, and each neuron has a p% chance of being thrown away.
Figure 1-13 Dropout sampling process
Figure 1-14 The NN structure after dropout sampling
During training, the model needs to be dropout sampled, but not sampled when testing, and each parameter is multiplied by (1-p)%. As shown in 1-15, assuming a dropout chance of 50%, half of the neurons in training are discarded. In the test, in order to make the test and training output the same as possible, you need to each weight is multiplied by (1-p)%, to maintain the balance of the output value (1-15 right image).
Figure 1-15 Dropout Test weight handling
Dropout principle Explanation
Dropout can be seen as an integrated learning. The practice of integrated learning is roughly to sample multiple data from a training set, and to train different models separately (the structure of the model can be different). The test set is predicted with multiple trained models, and the final result is averaged (1-16).
Figure 1-16 How to deal with integrated learning
It is assumed that the number of neurons in the designed neural network is m, and that each neuron may be dropout or not dropout. Therefore, each neuron has 2 choices, and M neurons have a 2M choice, corresponding to produce 2M model structure. Therefore, when training a model, it is equivalent to training multiple models. For one of the weights in the model, it is shared among the different dropout neural networks.
Figure 1-17 Dropout Training process
However, after training, you need to make predictions. However, it is not possible to store so many models separately and predict them separately. So, in order to solve this problem, in all the dropout model weights are multiplied (1-p)%.
Figure 1-18 Dropout weight processing
The dropout behaves better on a linear activation function. The reason is that when the activation function is linear, the output of the model after all weights are multiplied (1-p)%,dropout is closer to the result of the integrated output.
sigmoid gradient Vanishing Analysis
However, the use of the sigmoid function as an activation function has a phenomenon of gradient disappearance. Is that when the number of hidden layers of the neural network exceeds 3 layers, the underlying parameter update is almost 0, because the sigmoid derivation formula is S (x) ' =s (x) (1-s (x)), when X=0,s (x) =0.5, MAXS (x) ' = 0.25 and when we want to solve the underlying parameters, we need to multiply the slope of the upper parameters, that is, to multiply more than the number of less than 0.25, when the number of multiply more, the value will become very small, resulting in gradient extinction phenomenon. And since the slope of the Relu function is 1, the derivation in the multiplicative, will not produce the above situation.
References
[1] Machine learning-Li Hongyi