Deep Learning (rnn, CNN) tuning experience?

Last Update:2016-06-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Organized Links: https://www.zhihu.com/question/41631631
Source: Know
Copyright belongs to the author. Commercial reprint please contact the author for authorization, non-commercial reprint please specify the source.

Adjusted for almost 1 years rnn, deeply felt that deep learning is an experimental science, the following are some of the alchemy experience. will continue to be added later. Where there is a problem, please correct me, too.

Parameter initialization, the following several ways, randomly choose one, the results are almost the same.

Uniform
W = Np.random.uniform (Low=-scale, High=scale, Size=shape)
Glorot_uniform
Scale = NP.SQRT (6./(Shape[0] + shape[1]))
Np.random.uniform (Low=-scale, High=scale, Size=shape)
Gaussian initialization:
w = NP.RANDOM.RANDN (n)/sqrt (n), n is the number of arguments
If the activation function is Relu, it is recommended
w = NP.RANDOM.RANDN (n) * sqrt (2.0/n)
SVD, the RNN effect is better, can effectively improve the convergence rate.

Data preprocessing methods
1. Zero-center, this is quite common.
  X-= Np.mean (x, axis = 0) # Zero-center
  X/= np.std (x, axis = 0) # normalize
2. PCA Whitening, this is used relatively little.
Training skills
2. To do gradient normalization, the calculated gradient divided by minibatch size
4. Clip C (Gradient clipping): Limit the maximum gradient, which is actually value = sqrt (w1^2+w2^2 ...), if value exceeds the threshold, even if a coefficient of attenuation is equal to the threshold value of value: 5,10,15
6. Dropout to small data to prevent overfitting has a good effect, the value is generally set to 0.5, small data on the DROPOUT+SGD effect is better. The location of the dropout is more fastidious, for RNN, it is recommended to put the input->rnn and rnn-> output position. About RNN How to use dropout, you can refer to this paper:/http/arxiv.org/abs/ 1409.2329
8. Adam,adadelta, in small data, I experiment here is not as good as SGD, if you use SGD, you can choose from 1.0 or 0.1 of the learning rate to start, a period of time, check on the verification set, if cost does not fall, the learning rate halved. I've seen so many papers, and my own experiments have been very good. Of course, you can first use the ADA series first run, and finally quickly converge, the replacement of SGD to continue training. There will also be improvements.
10. In addition to gate and other places, you need to limit the output to 0-1, try not to use sigmoid, you can use Tanh or relu such as activation functions.
12. The dim and embdding size of the RNN are generally adjusted from 128 onwards. Batch size is generally adjusted from around 128. Batch size is the most important, not the bigger the better.
14. Word2vec initialization, on the small data, not only can improve the convergence speed, but also can improve the results.
16. Try to shuffle the data.
18. LSTM's forget gate bias, initialized with a value of 1.0 or greater, can achieve better results from this paper: http//jmlr.org/proceedings/papers/v37/ Jozefowicz15.pdf, my experiment here is set to 1.0, which can improve the speed of convergence. Actual use, different tasks, may require different values to be tried.
Ensemble: Paper brush results of the ultimate nuclear weapons, deep learning generally in the following ways
2. Same parameters, different initialization methods
4. Different parameters, by Cross-validation, select the best few groups
6. The same parameters, the different stages of model training
7. Different models for linear fusion. Examples include RNN and traditional models. To add, Adam Convergence is fast but the solution is often not sgd+momentum to get better, if you do not take into account the cost of time or SGD bar.
  Add a rnn trick, still without considering the cost of time, batch size=1 is a very good regularizer, at least on some tasks, this may be a lot of people can not reproduce the results of Alex Graves, one of the reasons, Because he always sets the batch size to 1 ... Recently watching Karpathy's cs231n, not yet read, but the process summarizes some of the techniques he mentioned:
  
  About parameters:
  Model Ensembles:
  1. Whether CNN or Rnn,batch normalization useful, not necessarily result in a few points, convergence is much faster
  2. Data initially normalize good, sometimes directly increase the 2 points, such as CIFAR10, transferred to YUV under normalize again SCN
  3.loss does not drop LR on the other 10
  4. Google's inception series is never reproduced according to what it says in the paper.
  I do rnn text-related research, has been using Adam, the speed of falling fast, see others paper also tried Adadelta and Adagrad, but do not know the posture is wrong or other reasons the effect is poor, batch-size size set in 50 (training set size of 17w), Dropout 0.5, in addition I found that the value of L2 to my results are very large, has been followed by others paper with 0.001, I replaced 0.0001 after the discovery of 1%

Deep Learning (rnn, CNN) tuning experience?

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Deep Learning (rnn, CNN) tuning experience?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Deep Learning (rnn, CNN) tuning experience?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support