hahaha, finally came to each lesson of the time to write notes. The content of this course is relatively small, perhaps in order to give problem set 1 free time.
Don't say much nonsense. Write it.
This video tells the main thing is the recommended paper a little more detailed explanation of part, and then added to the RNN model opened.
1.multi-task learning/weight Sharing
This first part is actually in the NLP (almost) from scratch also has to explain its idea is, in the DL model inside several layers of the feature is the same or approximate, then this, only need to unify to the DL model of the following layers of modeling, Then it is better to build different models on top of each task. This can greatly simplify the complexity of the model, but also can make maximum use of the existing data, extract more features.
And then there are a few tables listed, indicating that our multi-task learning is good, with the facts to speak.
2.non-linearity
The next step is to non-linearity the input data in each layer. The main types are: 1.sigmoid function 2.hyperbolic tangent function 3.hard tanh function 4.soft sign 5.rectified linear function this Five functions are flexibly selected according to the actual situation. Specifically how to choose the class did not say--, but in the recommendation to read the paper quoted, the specific problems encountered when referring to this paper will be everything OK.
This section in the recommended reading on the 14th page of the lower left corner, each paper has links can be easily found.
Here are the five common non-linearities function expressions and their corresponding graphs:
In fact Tanh is sigmoid rescaled and shifted after the product.
Tanh (z) = 2*logistic (2z)-1
3.Gradient Checks
Alas, this gradient checks is very powerful. But be sure to use the formula that is high precision, that is f (x + E)-f (x-e)/2e can not use F (x + E)-f (x)/E Because of the lower precision behind.
Epsilon choice is also very fastidious oh, not too big or too small. In the words of paper: Contrary to naive expectiations, the relative difference could grow if we choose an eplison that is too small. i.e., the error should first decrease as Eplison is decreased and then could worsen when numerical precision kicks in, due t o non-linearities.
Finally, based on the continuous trial and experience, we find that Eplison is the best choice for 10^-4.
What if the gradient checks fails?
First step: Simplify the model know no bugs
Step Two: Add a hidden layer
。。。。。。
is to increase the complexity of your model step by step, and then in a certain step of the bug will know that the bug appears here, troubleshooting, and then add complexity, and then troubleshooting, until the original model.
4.Parameter initialization
Parameter initialization is also very fastidious, because the DL node too much complexity is very high, so accidentally optimized to the local optimum. It is also easy to make the initial value near a large section of "Plateau", the optimization time will be very long. Parameters cannot be symmetrical, otherwise the result of a previous layer of output will be too similar to be optimized.
Considering the above, there is practical experience, the initialization parameters should choose uniform distribution (uniform (-R, R)), when non-linear function is sigmoid function r = sqrt (6/fan-in + fan-out), when Non-linear function is tanh r = 4 * sqrt (6/fan-in + fan-out) where fan-in and fan-out are the SI of the previous layer of the current layer The size of the ze and after layer.
5.Learning Rates
Learning rates not too big, too big easy over shooting too small words training speed too slow.
One way to do this is if epsilon_t = Epsilon_0 * Period/max (t, period) exceeds period when epsilon_t is reduced over time.
Another method is more magical, different parameter have different learning rate, this parameter before the gradient of the sum is very large then learning rate is small, the reverse is large
The way to do this is to use an array to record all the gradients used before this parameter, this time learning rate = fixed number/sqrt (sum (all gradients used before)
6.Prevent overfitting
There are four ways to prevent the appearance of overfitting: 1. Reduce the complexity of the model, reduce the number of units per layer of the model, or directly reduce layers 2. Using weight decay commonly used is L1 and L2 regularization on Weights 3. Stop training to a certain extent, use parameters with best validation error 4.sparsity contraints on hidden Activations This method, I've seen it in Auto-encoder.
7.Recurrent Neural Network language model
This lesson in the introduction of RNN is very simple to say, this model is called recurrent because the inside of the parameters is recurrent use, what is the advantage of not introduced--
I look forward to the next lesson in detail
CS224D Lecture 6 Notes