Deep Learning Network Assistant skills _02

Source: Internet
Author: User

Reprinted from Alchemy Laboratory: https://zhuanlan.zhihu.com/p/24720954

I have previously written an article about deep learning training skills, which includes some of the assistant experience: Deep learning training experience. However, as a result of the general deep learning experiment, compared to ordinary machine learning tasks, time is longer, so the assistant skills are particularly important. At the same time, personal practice, there are some new ideas, so here to write an article alone, talk about the depth of their understanding of the study, if you have other skills, but also welcome a lot of communication. A good experimental environment is half the success.

As a result of the deep learning experiment, the code style of the experimental environment, you can make manual or automatic adjustment to more labor-saving, there are the following points may need to note: The settings of the various parameters are concentrated together. If the settings of the parameters are distributed everywhere in the code, then the process of modification must be very painful. You can output the loss function values of the model as well as the accuracy on the training set and the validation set. Consider designing a subroutine that can initiate training and monitor and periodically save the evaluation results based on the given parameters. Again by a main program, the allocation of parameters and the parallel launch of a series of sub-programs. Drawing

Drawing is a good habit, usually after the training data traversal round, the output of the training set and validation set accuracy rate. At the same time, draw a picture. After this training period, if the model has not been convergent, then you can stop training and try other parameters to save time.
If training to the end, the training set, the test set accuracy is very low, then the model may be less than fit. Then the subsequent adjustment parameter direction, is to enhance the model's fitting ability. For example, increase the number of network layers, increase the number of nodes, reduce the dropout value, reduce the L2 regular value and so on.
If the training set accuracy is high, the test set accuracy ratio is low, then the model may be over-fitting, this time need to improve the model generalization ability direction, adjust parameters. parameter adjustment from coarse to subdivided stage

In practice, the initial scope of the search, and then according to the good results appear in the place, and then narrow the scope for a more granular search. It is suggested that the parameters given in the paper should be taken as the initial parameters by referring to relevant papers. At least the parameters in the paper are a good result. If you can't find a reference, you can only try it yourself. Can first from the comparison of important, the experimental results have a large impact on the parameters of the beginning, while fixing other parameters, get a similar result, on the basis of this result, and then adjust the other parameters. For example, the learning rate is generally higher than the regular value, the dropout value is important, the learning rate is not appropriate, not only the results may become worse, the model will not even converge. If you can't find a set of parameters, you can make the model converge. Then you need to check that there are problems elsewhere, such as model implementations, data, and so on. can refer to my written deep learning Network debugging skills to improve the speed

The parameter is only for the purpose of finding the right parameters, not the final model. Generally the appropriate parameters on the small data set, the effect on the big data set will not be too bad. You can therefore try to streamline the data to increase the speed and try more parameters in a limited amount of time. Sample the training data. For example, the original 100W data, first sampled to 1 W, experiment to see. Reduce the training category. For example, handwritten numeral recognition task, originally is 10 categories, then we can first in 2 categories training to see how the results. Super Parameter Range

It is recommended to perform a hyper-parametric search on a logarithmic scale. The more typical is the learning rate and regularization items, which we can try from 0.001 0.01 0.1 1 10, with 10 as the order. Because their effect on training is multiplied. However, there are some parameters, it is recommended to search on the original scale, such as dropout value: 0.3 0.5 0.7). Experience Parameters

Here are some of the parameters of the experience, to avoid the time when we have no clue. Learning rate:1 0.1 0.01 0.001, generally starting from 1 try. It is rare to see learning rate greater than 10. The learning rate is generally attenuated with training. The attenuation coefficient is generally 0.5. The attenuation timing can be when the validation set accuracy is no longer rising, or after a fixed number of periods of training.
However, it is more recommended to use adaptive gradients, such as Adam,adadelta,rmsprop, which are generally used by the relevant papers to provide the default values, you can avoid the difficulty of adjusting the learning rate. For RNN, there is an experience, if rnn to deal with the sequence is longer, or RNN layer is more, then learning rate is generally better, otherwise there may be results not convergence, even nan and other problems. Number of network layers: Start with the 1 tier first. Nodes per layer: 16 32 128, more than 1000 of cases are relatively rare. More than 1W have never seen. Batch size:128 starts up and down. The batch size value increases and does increase the training speed. However, it is possible that the convergence results become worse. If the memory size allows, consider starting with a larger value to try. Because batch size is too large, it generally does not have a large effect on the results, and the batch size is too small, the result may be poor. Clip C (Gradient clipping): Limit the maximum gradient, which is actually value = sqrt (w1^2+w2^2 ...), if value exceeds the threshold, even if the value is a coefficient of attenuation, let values equal the threshold value: 5,10,15 dropout:0.5 L2 Regular: 1.0, more than 10 is very rare. Word vector embedding size: 128,256 positive and negative sample ratio: This is very overlooked, but in many classification problems, but also very important parameters. Many people tend to use the default positive and negative proportions in the training data, and when the training data is very unbalanced, the model is likely to favor a larger number of categories, which can affect the final training result. In addition to trying to train the data by default plus and minus categories, it is recommended that a small number of samples be sampled, such as replication. Increase their proportions and see how it works, and this applies equally to multi-classification issues.
When using the Mini-batch method for training, try to make a batch, the balance of the various types, which is very important in multi-classification tasks such as image recognition. Auto-parameter

Artificial stare at the experiment, after all, too tired. There is also a lot of research on auto-tuning. Here are a few more practical ways: gird Search. This is the most common. Specifically, each parameter determines several values to try, and then, like a grid, iterates through the combination of all parameter values. The advantage is to achieve simple violence, if it can be all traversed, the results are more reliable. The disadvantage is too much time, especially like a neural network, generally do not try too many parameter combinations. Random Search. Bengio in the random search for hyper-parameter optimization that random search is more efficient than gird search. In practice, it is usually the first time to use the Gird search method, to get all the candidate parameters, and then each time from the random selection of training. Bayesian optimization. Bayesian optimization, which takes into account the experimental result values corresponding to different parameters, saves time. Compared to the Internet search is simply the difference between an old bull and a sports car. The specific principle can refer to this paper: Practical Bayesian optimization of machine learning algorithms, here also recommended two implementation of the Bayesian Assistant Python library, can be used: jaberg/ Hyperopt, relatively simple. Fmfn/bayesianoptimization, more complex, support parallel assistant. summarize plausibility checks to determine the model, data, and other places without problems. Track loss function values during training, training set and validation set accuracy. Search for the best hyper-parameters using random search, in stages from coarse (larger range of parameters to less-trained cycles) to thin (longer periods of training for smaller, hyper-parameter ranges). References

Here are some parameter information, we have time, you can read further.
Practical recommendations for gradient-based training of deep architectures by Yoshua Bengio (+)
Efficient BACKP ROP, by Yann LeCun, Léon Bottou, Genevieve Orr and Klaus-robert Müller
Neural networks:tricks of the ' Trade ', edited by Grégoire Montavon, Geneviève Orr, and Klaus-robert Müller.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.