CS224D Lecture 6 Notes

Last Update:2015-07-26 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

hahaha, finally came to each lesson of the time to write notes. The content of this course is relatively small, perhaps in order to give problem set 1 free time.

Don't say much nonsense. Write it.

This video tells the main thing is the recommended paper a little more detailed explanation of part, and then added to the RNN model opened.

1.multi-task learning/weight Sharing

This first part is actually in the NLP (almost) from scratch also has to explain its idea is, in the DL model inside several layers of the feature is the same or approximate, then this, only need to unify to the DL model of the following layers of modeling, Then it is better to build different models on top of each task. This can greatly simplify the complexity of the model, but also can make maximum use of the existing data, extract more features.

And then there are a few tables listed, indicating that our multi-task learning is good, with the facts to speak.

2.non-linearity

The next step is to non-linearity the input data in each layer. The main types are: 1.sigmoid function 2.hyperbolic tangent function 3.hard tanh function 4.soft sign 5.rectified linear function this Five functions are flexibly selected according to the actual situation. Specifically how to choose the class did not say--, but in the recommendation to read the paper quoted, the specific problems encountered when referring to this paper will be everything OK.

This section in the recommended reading on the 14th page of the lower left corner, each paper has links can be easily found.

Here are the five common non-linearities function expressions and their corresponding graphs:

In fact Tanh is sigmoid rescaled and shifted after the product.

Tanh (z) = 2*logistic (2z)-1

3.Gradient Checks

Alas, this gradient checks is very powerful. But be sure to use the formula that is high precision, that is f (x + E)-f (x-e)/2e can not use F (x + E)-f (x)/E Because of the lower precision behind.

Epsilon choice is also very fastidious oh, not too big or too small. In the words of paper: Contrary to naive expectiations, the relative difference could grow if we choose an eplison that is too small. i.e., the error should first decrease as Eplison is decreased and then could worsen when numerical precision kicks in, due t o non-linearities.

Finally, based on the continuous trial and experience, we find that Eplison is the best choice for 10^-4.

What if the gradient checks fails?

First step: Simplify the model know no bugs

Step Two: Add a hidden layer

。。。。。。

is to increase the complexity of your model step by step, and then in a certain step of the bug will know that the bug appears here, troubleshooting, and then add complexity, and then troubleshooting, until the original model.

4.Parameter initialization

Parameter initialization is also very fastidious, because the DL node too much complexity is very high, so accidentally optimized to the local optimum. It is also easy to make the initial value near a large section of "Plateau", the optimization time will be very long. Parameters cannot be symmetrical, otherwise the result of a previous layer of output will be too similar to be optimized.

Considering the above, there is practical experience, the initialization parameters should choose uniform distribution (uniform (-R, R)), when non-linear function is sigmoid function r = sqrt (6/fan-in + fan-out), when Non-linear function is tanh r = 4 * sqrt (6/fan-in + fan-out) where fan-in and fan-out are the SI of the previous layer of the current layer The size of the ze and after layer.

5.Learning Rates

Learning rates not too big, too big easy over shooting too small words training speed too slow.

One way to do this is if epsilon_t = Epsilon_0 * Period/max (t, period) exceeds period when epsilon_t is reduced over time.

Another method is more magical, different parameter have different learning rate, this parameter before the gradient of the sum is very large then learning rate is small, the reverse is large

The way to do this is to use an array to record all the gradients used before this parameter, this time learning rate = fixed number/sqrt (sum (all gradients used before)

6.Prevent overfitting

There are four ways to prevent the appearance of overfitting: 1. Reduce the complexity of the model, reduce the number of units per layer of the model, or directly reduce layers 2. Using weight decay commonly used is L1 and L2 regularization on Weights 3. Stop training to a certain extent, use parameters with best validation error 4.sparsity contraints on hidden Activations This method, I've seen it in Auto-encoder.

7.Recurrent Neural Network language model

This lesson in the introduction of RNN is very simple to say, this model is called recurrent because the inside of the parameters is recurrent use, what is the advantage of not introduced--

I look forward to the next lesson in detail

CS224D Lecture 6 Notes

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

CS224D Lecture 6 Notes

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

CS224D Lecture 6 Notes

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support