Organized Links: https://www.zhihu.com/question/41631631
Source: Know
Copyright belongs to the author. Commercial reprint please contact the author for authorization, non-commercial reprint please specify the source.
Adjusted for almost 1 years rnn, deeply felt that deep learning is an experimental science, the following are some of the alchemy experience. will continue to be added later. Where there is a problem, please correct me, too.
Parameter initialization, the following several ways, randomly choose one, the results are almost the same.
- Uniform
W = Np.random.uniform (Low=-scale, High=scale, Size=shape)
- Glorot_uniform
Scale = NP.SQRT (6./(Shape[0] + shape[1]))
Np.random.uniform (Low=-scale, High=scale, Size=shape)
- Gaussian initialization:
w = NP.RANDOM.RANDN (n)/sqrt (n), n is the number of arguments
If the activation function is Relu, it is recommended
w = NP.RANDOM.RANDN (n) * sqrt (2.0/n)
- SVD, the RNN effect is better, can effectively improve the convergence rate.
Data preprocessing methods
- Zero-center, this is quite common.
X-= Np.mean (x, axis = 0) # Zero-center
X/= np.std (x, axis = 0) # normalize
- PCA Whitening, this is used relatively little.
Training skills
- To do gradient normalization, the calculated gradient divided by minibatch size
- Clip C (Gradient clipping): Limit the maximum gradient, which is actually value = sqrt (w1^2+w2^2 ...), if value exceeds the threshold, even if a coefficient of attenuation is equal to the threshold value of value: 5,10,15
- Dropout to small data to prevent overfitting has a good effect, the value is generally set to 0.5, small data on the DROPOUT+SGD effect is better. The location of the dropout is more fastidious, for RNN, it is recommended to put the input->rnn and rnn-> output position. About RNN How to use dropout, you can refer to this paper:/http/arxiv.org/abs/ 1409.2329
- Adam,adadelta, in small data, I experiment here is not as good as SGD, if you use SGD, you can choose from 1.0 or 0.1 of the learning rate to start, a period of time, check on the verification set, if cost does not fall, the learning rate halved. I've seen so many papers, and my own experiments have been very good. Of course, you can first use the ADA series first run, and finally quickly converge, the replacement of SGD to continue training. There will also be improvements.
- In addition to gate and other places, you need to limit the output to 0-1, try not to use sigmoid, you can use Tanh or relu such as activation functions.
- The dim and embdding size of the RNN are generally adjusted from 128 onwards. Batch size is generally adjusted from around 128. Batch size is the most important, not the bigger the better.
- Word2vec initialization, on the small data, not only can improve the convergence speed, but also can improve the results.
- Try to shuffle the data.
- LSTM's forget gate bias, initialized with a value of 1.0 or greater, can achieve better results from this paper: http//jmlr.org/proceedings/papers/v37/ Jozefowicz15.pdf, my experiment here is set to 1.0, which can improve the speed of convergence. Actual use, different tasks, may require different values to be tried.
Ensemble: Paper brush results of the ultimate nuclear weapons, deep learning generally in the following ways
- Same parameters, different initialization methods
- Different parameters, by Cross-validation, select the best few groups
- The same parameters, the different stages of model training
- Different models for linear fusion. Examples include RNN and traditional models. To add, Adam Convergence is fast but the solution is often not sgd+momentum to get better, if you do not take into account the cost of time or SGD bar.
Add a rnn trick, still without considering the cost of time, batch size=1 is a very good regularizer, at least on some tasks, this may be a lot of people can not reproduce the results of Alex Graves, one of the reasons, Because he always sets the batch size to 1 ... Recently watching Karpathy's cs231n, not yet read, but the process summarizes some of the techniques he mentioned:
About parameters:
- Typically, the method of updating the parameter defaults to the Adam effect.
- If you can load all data (full batch updates), you can use the l-bfgs
Model Ensembles:
- Train multiple models, averaging the results when you test, and you can get a 2% boost.
- When training a single model, the results of checkpoints in the average different periods can also be improved.
- You can combine the parameters of the test with the parameters of the training when testing:
1. Whether CNN or Rnn,batch normalization useful, not necessarily result in a few points, convergence is much faster
2. Data initially normalize good, sometimes directly increase the 2 points, such as CIFAR10, transferred to YUV under normalize again SCN
3.loss does not drop LR on the other 10
4. Google's inception series is never reproduced according to what it says in the paper.
I do rnn text-related research, has been using Adam, the speed of falling fast, see others paper also tried Adadelta and Adagrad, but do not know the posture is wrong or other reasons the effect is poor, batch-size size set in 50 (training set size of 17w), Dropout 0.5, in addition I found that the value of L2 to my results are very large, has been followed by others paper with 0.001, I replaced 0.0001 after the discovery of 1%
Deep Learning (rnn, CNN) tuning experience?