Depth model Optimization Performance Tuning parameter __deep

Source: Internet
Author: User
AttentionRefer to the validation set. Trainset loss can usually be lowered, but validation set loss begins to rise gradually after a period of reduction, when the model begins to fit on the training set. Focusing on Val loss changes, Val acc may mutate, but loss measures the overall goal. First, the learning rate of the parameter. By predicting the model, we can judge the learning degree of the model, if the Softmax output in the 0 or 1 edge description is not bad, if the 0.5 edge description model needs to be improved. The argument is just for the right parameters. Generally, the appropriate parameters on small datasets will not be too bad on large datasets. Therefore, you can try to sample a subset of datasets to improve speed and to try more parameters in a limited amount of time. Learning Rate (important)

Drawing analysis is a good way to adjust the learning rate: learning rate is too large, loss curve may rise, or can not always decline, vibration situation, due to the learning rate is large, leading to parameters in the vicinity of the best lingering, loss size small, that is, but can not reach the most advantages, easy to fit in the local minimum value. The learning rate is too small loss curves may fall too slowly. The good learning rate loss a smooth downward curve. The
Learning rate represents the updated step of the parameter, and the shorter you go, the less you will miss each point, but the time is wasted. And the smaller the step, the easier to get the local optimization (to the larger valley, it will not go out), and too big may directly cross the overall best advantage. LR: Learning rate is too large, easy gradient explosion, loss change Nan. Take 1 0.1 0.01.0001 ... 10e-6 (logarithmic change). usually take 0.1. By predicting the validation set, a better LR can be chosen. If the current learning rate does not continue to improve on the validation set, you can divide the learning rate by 2 or 5 to try. Decay: The loss of training set down to a certain extent, and then no longer fall, and train loss in a range of shocks, can not be further down (that is, loss jump to the left of the lowest point, jump to the bottom right, because the learning rate is too high is not continue to decline). The attenuation learning rate can be adopted, and the learning rate is gradually attenuated with the training. Take 0.5. For example, when Val loss meet No-improvement rules, should have taken early stopping, but can not stop, but let learning rate half continue to run. So repeatedly. Attenuation methods are: attenuation to a fixed minimum learning rate linear attenuation, exponential decay, or 2-10 times as much attenuation per Val loss stagnation. Fine-tuning, you can add a new layer of learning rate, reuse layer of learning rate can be set relatively low.

clr-Chinese parameter initialization destroys symmetry between different cells, and if two cells are the same, accepting the same input must have different initialization parameters, the model will always update the two units in the same way. Greater initialization weights are more likely to break symmetry, and avoid loss of signal in the process of gradient forward and reverse propagation, but if the weight initialization is too large, it is easy to produce explosive value (the gradient explosion can use gradient truncation mitigation), and in CNN it will cause the model to be highly sensitive to input, Leads to deterministic forward propagation processes that behave randomly, and can easily cause the activation function to produce a saturated gradient that leads to loss of gradient. Not available:
initialized to 0, the model cannot be updated, and the model weights are the same, resulting in high symmetry of the model. The
is initialized to very small random numbers (nearly 0, but not 0), and the model does not work well, causing the gradient information to disappear in the propagation. Recommendation:
General bias is initialized to 0, but it is possible to take 1 (lstm) in RNN.
Random_uniform
Random_normal
Glorot_normal
Glorot_uniform (no brain Xavier, in order to make the information in the network better flow, the variance of each layer output is as equal as possible. )

Activate function relu: Universal activation function to prevent gradient dispersion problems. The last layer of caution with relu to do activation. The differentiable property of Sigmoid is the best choice of traditional neural networks, but the gradient vanishing and non 0 Point center problems are introduced in the deep network. In addition to gate and other places, you need to limit the output to 0-1, try not to use sigmoid,. The sigmoid function has a larger gradient in the range of 4 to 4. , the gradient is close to 0, which can easily cause the gradient vanishing problem. Input 0 mean value, the output of the sigmoid function is not 0 mean. Tanh range between 1 and 1, and zero-center, but is better than sigmoid, but still have saturation gradient disappearance problem. Relu is better than sigmoid and Tanh, the derivative is easy to calculate and converge quickly and will not saturate. The only problem is that X is less than 0 o'clock and the gradient is 0, which can cause many neurons to die. The use of special attention to the LR set Leaky_relu and Relu variants, Maxout can try. The Relu and its variants are usually used, and the Prelu and Rrelu are very effective. Tanh can try, but don't use sigmoid.

The

Maxout adds a large number of parameters, as shown in the following diagram:

Normal network:
Z=w*x+b
Out=f (z)

maxout layer, where z number custom
z1=w1*x+b1
Z2=w2*x+b 2
z3=w3*x+b3
z4=w4*x+b4
z5=w5*x+b5
Out=max (z1,z2,z3,z4,z5) model If the input vector is fixed size, the full connection Feedforward network can be considered, and if the input is a two-dimensional structure such as image, the convolution network can be considered and the cyclic neural network can be considered if the input is a sequence. BN: Improves performance, accelerates training, and sometimes eliminates dropout. Note that BN does not perform well in batch size. Try to choose more hidden layer units, small filter, using Non-linear, and the past fitting through regularization method to avoid. First of all, we need to verify that the model has a problem, can take small data set, very deep model to see whether the model can fit the training set well, and the accuracy of the test set is very low. Batch_size: usually takes 2 n Times Square. Usually has little impact. Take 32 64 128. The larger the batch, the faster the general model acceleration effect. Without considering the time cost, batch size=1 can be used as a regularizer. The concat of different sizes of feature maps, using different scales of information. ResNet's shortcut has a role, shortcut's slip must be identity. The inception method can be used to extract higher order features of different abstraction degrees. Gradient normalization: After calculating the gradient, divide by the number of Minibatch. Good performance can be achieved with pre-trained model and fine-tune.

optimization function Adam: Faster convergence. Can be used without brain. Sgd+momentum: The effect is better than Adam, but the speed is slightly slower. M takes a value of 0.5 0.95 0.9 0.99 Gradient Truncation: Limit the maximum gradient or set the threshold so that the gradient is forced equal to 5,10,20. convolution stride, pool strides: Take a small value, often take 1,2, filter number doubled (quantity 2^n, the first layer should not be too little, N for the number of layers). The reverse convolution is reversed. Kernel size: It is popular to use small size (3*3), note that for large targets, the perception of the field is too small may affect performance. Especially for the FCN,FC full connection after all have global vision. Pooling: 2*2 DataSet, input preprocessing and output

How to tell if more data should be collected:
First of all, whether the current model on the training set of good performance, if the performance on the current training set is very poor, at this time should focus on improving the model, increase network layer or hidden units. If you use a larger model or do not work well, consider whether the dataset is poor quality (contains too much noise, etc.), or that the model has a fundamental error.
Then, if the model is acceptable for performance on the training set, but poorly on the test set, you can consider collecting more data at this point. However, the feasibility and cost of collecting data need to be considered at this time. If the price is too high, one possible way is to reduce the size of the model, improve regularization, and adjust parameters. Data is usually collected on a logarithmic scale.
  
1. As much as possible to obtain more data (millions above), remove bad data (noise, false data or null values, etc., the data in the presence of Nan value will cause the model loss into Nan).
2. Do data augment when you are not good enough. For images that can be flipped horizontally, randomly trimmed crop, rotated, twisted, scaled, stretched, changed hue, saturation (HSV), etc., can also be randomly combined. Attention should be paid to the changes in the picture (vertical flip) Whether it conforms to the actual, whether the loss of important features and so on. Alexnet 256 of the picture to 224 random crop sampling, for each picture, produced 2048 different samples, using the mirror, the dataset turned 2048*2=4096 times. Although a large number of resampling results in the correlation of data, it avoids the cross fitting. It is better to enter the same image than multiple epoch at least. There is also a fancy PCA sampling method in Alexnet.
2. Input training set must be shuffle. Note that the Shuflle in the Keras is aimed at batch internal disruption.
3. Input features normalized, zero-center and normalize. PCA whitening is generally not required. There is also processing between -1~1 or between 0~1.
4. The target (label) of the forecast is well normalized. For example, in the regression problem, the label difference is too large (0.1 and 1000), do normalization can unify the dimension
5. Data set category imbalance problem: sample on and off. Or, use the data augment method to sample fewer categories. To loss the category imbalance, weight adjustment. Or split the dataset by category, giving priority to training a large number of category data, and then training a small number of category data.
6. Not only the training set is strengthened, the test set is also better to be enhanced. Keep the distribution consistent.
7. Label smoothing for solving the fitting, not necessarily work. New_labels = (1.0-label_smoothing) * one_hot_labels + label_smoothing/num_classes

# Zero-center Normalize
x = Np.mean (x, axis = 0)
x/= np.std (x, axis = 0)
objective FunctionMulti-task situations, each loss as far as possible to limit to a magnitude, initially can focus on a task of loss. The focal loss may have a role. regularization


How to judge whether to fit or not fit: if the accuracy rate of training set has been significantly higher than the validation set, the model can increase the regularization intensity, such as increasing L2 regular punishment, increasing dropout random inactivation rate and so on. If the training set has been smaller than the validation set, the description slightly over the fit, and if the training set and the accuracy of the test set is comparable, indicating that the model is a little less than fit, not very good to learn the characteristics, you can adjust the model width, depth and so on.
In the training process, the L2 norm makes the weight components as balanced as possible, that is, the number of non 0 components is as dense as possible, while the L1 norm and the L0 norm make the weight components sparse, that is, the number of non-0 components is less.
Automatic selection of features for sparse performance implementation. Among the features we have previously assumed, there are many features that have less impact on the output and can be viewed as unimportant features. The regularization term automatically punishes the parameter parameters of the feature, so that the weighting coefficients of some features are 0 or close to 0, and the main independent variables or features are automatically selected.
If the neuron's output is close to 1 and we think it's activated, and the output is close to 0, it's suppressed, so the restriction that most of the time is suppressed is called the sparsity limit.
It is suggested that the regular penalty coefficient λ be set to 0 and the learning rate should be fixed after the learning rate is determined. Give λ a value (such as 1.0), and then, according to validation accuracy, make the λ larger or smaller by 10 times times, and then fine tune.
  
1. Note that unless the dataset is more (tens), it would be preferable to adopt a moderate regularization at the outset.
2. Dropout: A fairly simple and effective way to prevent fitting, take 0.3 0.5 (recommended, 0.5-generated combined network max) 0.7
2. L2: A more commonly used regular method. Add the part about the weight in the target function.
3. L1: A more commonly used regular method. Add the part about the weight in the target function. Can be used with L2.
4. Max Norm Constraints: Because it limits the weight size, after using this constraint, the network generally does not appear "explosion" problem.esembleThe same parameters, different initialization methods. Use Cross-validation to find the best parameters and then train multiple models using different parameter initialization methods. Different parameters, through Cross-validation, select the best groups or better performance of the Top-k group. The same parameters, model training of different stages, that is, the number of different iterations of the model. Different models for linear fusion. Examples include RNN and traditional models. Different training sets were combined to extract different characteristics of the model.

Simple point: Voting method, average method, weighted average method
Stacking method:


First floor you have models M1 and M3, first, we use 5-fold training for M1 to get 5 models, then we can predict the training set F1 of the second level by the merging of the training sets (50 percent), and then use the 5 sub models of M1 to obtain the mean value T1 of the test set, and get the test set of the second layer. The first layer model M3 also uses this method to generate F3 and T3. Your second-tier model, M2, has a training set (F1,F2), a test set (T1,T2), and a second-tier model for training predictions.
Blending method:
Divide the data into train,test, and then divide the train into two disjoint parts train_1,train_2.
Using different models for train_1 training, train_2 and test predictions, generate two 1-D vectors, how many models will generate the number of dimensional vectors.
The second layer uses the previous model to train_2 generated vectors and labels as a new training set, using LR or other models to train a new model to predict the vectors generated by test.image visual tuning of parametersVisual activation layer, parametric layer visualization Analysis 1 Visual Analysis 2 visual output error: By judging the model output error samples, analyze the reasons. The histogram of the visual activation function value and gradient: The activation value of the hidden cell can tell us the unit saturation degree; The rapid growth and disappearance of the gradient is not conducive to model optimization; The update of the gradient parameter in a small batch parameter update is preferably about 1% of the original parameter, not 50% or 0.001%. Note If the data is sparse (such as natural language), then some parameters may be infrequently updated. TensorboardotherEarly stopSample Imbalance ProblemOver-sampling: random sampling, smote, Ramo, Random balance;cluster-based oversampling;databoost-im;class-aware less-than-sampling: random culling, easyensemble,one-sided Selection;data Decontamination threshold mobility: cost-sensitive learning: Focalloss,ohem, etc.Automatic parameter TuningRandom Search gird Search Bayesian optimizationReduce MemoryUse a larger gait using 1*1 linear convolution kernel dimension reduction using pool dimensionality reduction mini-batch size Change the data type from 32 to 16 bits using small convolution cores

Must Know tips/tricks in Deep neural Networks
Model Tuning Parameters: a summary of the parameters of the tuning tips

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.